Name: Ab Test Setup
Author: Coreyhaines31

Install

Terminal · npx

$npx skills add https://github.com/coreyhaines31/marketingskills --skill ab-test-setup

Works with Paperclip

How Ab Test Setup fits into a Paperclip company.

Ab Test Setup drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md353 linesmarkdown

Expand

1---2name: ab-test-setup3description: When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," "how long should I run this test," "growth experiments," "experiment velocity," "experiment backlog," "ICE score," "experimentation program," or "experiment playbook." Use this whenever someone is comparing two approaches and wants to measure which performs better, or when they want to build a systematic experimentation practice. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.4metadata:5  version: 1.2.06---7 8# A/B Test Setup9 10You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.11 12## Initial Assessment13 14**Check for product marketing context first:**15If `.agents/product-marketing-context.md` exists (or `.claude/product-marketing-context.md` in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.16 17Before designing a test, understand:18 191. **Test Context** - What are you trying to improve? What change are you considering?202. **Current State** - Baseline conversion rate? Current traffic volume?213. **Constraints** - Technical complexity? Timeline? Tools available?22 23---24 25## Core Principles26 27### 1. Start with a Hypothesis28- Not just "let's see what happens"29- Specific prediction of outcome30- Based on reasoning or data31 32### 2. Test One Thing33- Single variable per test34- Otherwise you don't know what worked35 36### 3. Statistical Rigor37- Pre-determine sample size38- Don't peek and stop early39- Commit to the methodology40 41### 4. Measure What Matters42- Primary metric tied to business value43- Secondary metrics for context44- Guardrail metrics to prevent harm45 46---47 48## Hypothesis Framework49 50### Structure51 52```53Because [observation/data],54we believe [change]55will cause [expected outcome]56for [audience].57We'll know this is true when [metrics].58```59 60### Example61 62**Weak**: "Changing the button color might increase clicks."63 64**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."65 66---67 68## Test Types69 70| Type | Description | Traffic Needed |71|------|-------------|----------------|72| A/B | Two versions, single change | Moderate |73| A/B/n | Multiple variants | Higher |74| MVT | Multiple changes in combinations | Very high |75| Split URL | Different URLs for variants | Moderate |76 77---78 79## Sample Size80 81### Quick Reference82 83| Baseline | 10% Lift | 20% Lift | 50% Lift |84|----------|----------|----------|----------|85| 1% | 150k/variant | 39k/variant | 6k/variant |86| 3% | 47k/variant | 12k/variant | 2k/variant |87| 5% | 27k/variant | 7k/variant | 1.2k/variant |88| 10% | 12k/variant | 3k/variant | 550/variant |89 90**Calculators:**91- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)92- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)93 94**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)95 96---97 98## Metrics Selection99 100### Primary Metric101- Single metric that matters most102- Directly tied to hypothesis103- What you'll use to call the test104 105### Secondary Metrics106- Support primary metric interpretation107- Explain why/how the change worked108 109### Guardrail Metrics110- Things that shouldn't get worse111- Stop test if significantly negative112 113### Example: Pricing Page Test114- **Primary**: Plan selection rate115- **Secondary**: Time on page, plan distribution116- **Guardrail**: Support tickets, refund rate117 118---119 120## Designing Variants121 122### What to Vary123 124| Category | Examples |125|----------|----------|126| Headlines/Copy | Message angle, value prop, specificity, tone |127| Visual Design | Layout, color, images, hierarchy |128| CTA | Button copy, size, placement, number |129| Content | Information included, order, amount, social proof |130 131### Best Practices132- Single, meaningful change133- Bold enough to make a difference134- True to the hypothesis135 136---137 138## Traffic Allocation139 140| Approach | Split | When to Use |141|----------|-------|-------------|142| Standard | 50/50 | Default for A/B |143| Conservative | 90/10, 80/20 | Limit risk of bad variant |144| Ramping | Start small, increase | Technical risk mitigation |145 146**Considerations:**147- Consistency: Users see same variant on return148- Balanced exposure across time of day/week149 150---151 152## Implementation153 154### Client-Side155- JavaScript modifies page after load156- Quick to implement, can cause flicker157- Tools: PostHog, Optimizely, VWO158 159### Server-Side160- Variant determined before render161- No flicker, requires dev work162- Tools: PostHog, LaunchDarkly, Split163 164---165 166## Running the Test167 168### Pre-Launch Checklist169- [ ] Hypothesis documented170- [ ] Primary metric defined171- [ ] Sample size calculated172- [ ] Variants implemented correctly173- [ ] Tracking verified174- [ ] QA completed on all variants175 176### During the Test177 178**DO:**179- Monitor for technical issues180- Check segment quality181- Document external factors182 183**Avoid:**184- Peek at results and stop early185- Make changes to variants186- Add traffic from new sources187 188### The Peeking Problem189Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.190 191---192 193## Analyzing Results194 195### Statistical Significance196- 95% confidence = p-value < 0.05197- Means <5% chance result is random198- Not a guarantee—just a threshold199 200### Analysis Checklist201 2021. **Reach sample size?** If not, result is preliminary2032. **Statistically significant?** Check confidence intervals2043. **Effect size meaningful?** Compare to MDE, project impact2054. **Secondary metrics consistent?** Support the primary?2065. **Guardrail concerns?** Anything get worse?2076. **Segment differences?** Mobile vs. desktop? New vs. returning?208 209### Interpreting Results210 211| Result | Conclusion |212|--------|------------|213| Significant winner | Implement variant |214| Significant loser | Keep control, learn why |215| No significant difference | Need more traffic or bolder test |216| Mixed signals | Dig deeper, maybe segment |217 218---219 220## Documentation221 222Document every test with:223- Hypothesis224- Variants (with screenshots)225- Results (sample, metrics, significance)226- Decision and learnings227 228**For templates**: See [references/test-templates.md](references/test-templates.md)229 230---231 232## Growth Experimentation Program233 234Individual tests are valuable. A continuous experimentation program is a compounding asset. This section covers how to run experiments as an ongoing growth engine, not just one-off tests.235 236### The Experiment Loop237 238```2391. Generate hypotheses (from data, research, competitors, customer feedback)2402. Prioritize with ICE scoring2413. Design and run the test2424. Analyze results with statistical rigor2435. Promote winners to a playbook2446. Generate new hypotheses from learnings245→ Repeat246```247 248### Hypothesis Generation249 250Feed your experiment backlog from multiple sources:251 252| Source | What to Look For |253|--------|-----------------|254| Analytics | Drop-off points, low-converting pages, underperforming segments |255| Customer research | Pain points, confusion, unmet expectations |256| Competitor analysis | Features, messaging, or UX patterns they use that you don't |257| Support tickets | Recurring questions or complaints about conversion flows |258| Heatmaps/recordings | Where users hesitate, rage-click, or abandon |259| Past experiments | "Significant loser" tests often reveal new angles to try |260 261### ICE Prioritization262 263Score each hypothesis 1-10 on three dimensions:264 265| Dimension | Question |266|-----------|----------|267| **Impact** | If this works, how much will it move the primary metric? |268| **Confidence** | How sure are we this will work? (Based on data, not gut.) |269| **Ease** | How fast and cheap can we ship and measure this? |270 271**ICE Score** = (Impact + Confidence + Ease) / 3272 273Run highest-scoring experiments first. Re-score monthly as context changes.274 275### Experiment Velocity276 277Track your experimentation rate as a leading indicator of growth:278 279| Metric | Target |280|--------|--------|281| Experiments launched per month | 4-8 for most teams |282| Win rate | 20-30% is common for mature programs (sustained higher rates may indicate conservative hypotheses) |283| Average test duration | 2-4 weeks |284| Backlog depth | 20+ hypotheses queued |285| Cumulative lift | Compound gains from all winners |286 287### The Experiment Playbook288 289When a test wins, don't just implement it — document the pattern:290 291```292## [Experiment Name]293**Date**: [date]294**Hypothesis**: [the hypothesis]295**Sample size**: [n per variant]296**Result**: [winner/loser/inconclusive] — [primary metric] changed by [X%] (95% CI: [range], p=[value])297**Guardrails**: [any guardrail metrics and their outcomes]298**Segment deltas**: [notable differences by device, segment, or cohort]299**Why it worked/failed**: [analysis]300**Pattern**: [the reusable insight — e.g., "social proof near pricing CTAs increases plan selection"]301**Apply to**: [other pages/flows where this pattern might work]302**Status**: [implemented / parked / needs follow-up test]303```304 305Over time, your playbook becomes a library of proven growth patterns specific to your product and audience.306 307### Experiment Cadence308 309**Weekly (30 min)**: Review running experiments for technical issues and guardrail metrics. Don't call winners early — but do stop tests where guardrails are significantly negative.310 311**Bi-weekly**: Conclude completed experiments. Analyze results, update playbook, launch next experiment from backlog.312 313**Monthly (1 hour)**: Review experiment velocity, win rate, cumulative lift. Replenish hypothesis backlog. Re-prioritize with ICE.314 315**Quarterly**: Audit the playbook. Which patterns have been applied broadly? Which winning patterns haven't been scaled yet? What areas of the funnel are under-tested?316 317---318 319## Common Mistakes320 321### Test Design322- Testing too small a change (undetectable)323- Testing too many things (can't isolate)324- No clear hypothesis325 326### Execution327- Stopping early328- Changing things mid-test329- Not checking implementation330 331### Analysis332- Ignoring confidence intervals333- Cherry-picking segments334- Over-interpreting inconclusive results335 336---337 338## Task-Specific Questions339 3401. What's your current conversion rate?3412. How much traffic does this page get?3423. What change are you considering and why?3434. What's the smallest improvement worth detecting?3445. What tools do you have for testing?3456. Have you tested this area before?346 347---348 349## Related Skills350 351- **page-cro**: For generating test ideas based on CRO principles352- **analytics-tracking**: For setting up test measurement353- **copywriting**: For creating variant copy

Related skills

Ad Creative

Generates ad copy at scale for Google, Meta, LinkedIn, and other platforms while checking character limits so your headlines don't get truncated. Reads existing

Ai Seo

This skill tackles the growing challenge of getting your content cited by AI systems like ChatGPT, Perplexity, and Google AI Overviews. It starts with an AI vis

Analytics Tracking

You know that moment when someone asks "how do we know if our marketing is working" and you realize your analytics setup is either broken or measuring vanity me