dataproduct-management
Experiment Design
Design rigorous A/B tests and product experiments — defining hypotheses, choosing metrics, calculating sample sizes, setting stopping rules, and writing analysis plans that avoid common statistical pitfalls.
a-b-testingexperimentshypothesissample-sizestatisticsproduct-analytics
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill experiment-designexperiment-design/
SKILL.md
Markdown
| 1 | |
| 2 | # Experiment Design |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user. If anything is missing, ask before proceeding: |
| 7 | |
| 8 | 1. **What change are you testing?** (UI change, algorithm tweak, pricing model, new feature rollout) |
| 9 | 2. **What outcome do you expect?** (Increase conversion, reduce churn, improve engagement — be specific) |
| 10 | 3. **Who is the target population?** (All users, a segment, new users only, specific market) |
| 11 | 4. **What is the current baseline?** (Current conversion rate, average revenue, retention rate — with approximate numbers) |
| 12 | 5. **What is the minimum detectable effect (MDE)?** (Smallest improvement worth detecting — e.g., +2pp conversion, +5% revenue) |
| 13 | 6. **What is the timeline?** (How long can the experiment run before a decision is needed?) |
| 14 | 7. **Are there any constraints?** (Traffic volume, seasonality, regulatory requirements, shared infrastructure) |
| 15 | |
| 16 | ## Experiment design template |
| 17 | |
| 18 | ### 1. Hypothesis |
| 19 | |
| 20 | State a falsifiable hypothesis in this format: |
| 21 | |
| 22 | ``` |
| 23 | If we [change], then [metric] will [direction] by at least [MDE], |
| 24 | because [reasoning based on user behavior or data]. |
| 25 | ``` |
| 26 | |
| 27 | A hypothesis without a mechanism ("because") is a guess. The mechanism forces you to articulate why the change should work, which informs metric selection and interpretation. |
| 28 | |
| 29 | ### 2. Primary Metric + Guardrail Metrics |
| 30 | |
| 31 | **Primary metric:** One metric that decides the experiment. Exactly one — not two, not "primary and secondary." If you cannot pick one, you do not understand the goal yet. |
| 32 | |
| 33 | **Guardrail metrics:** 2-4 metrics that must not degrade. These protect against winning on the primary metric at the cost of something else. |
| 34 | |
| 35 | | Role | Metric | Current Baseline | MDE | Direction | |
| 36 | |------|--------|-----------------|-----|-----------| |
| 37 | | Primary | Checkout conversion rate | 3.2% | +0.5pp | Increase | |
| 38 | | Guardrail | Revenue per user | $12.40 | -$0.50 | Must not decrease | |
| 39 | | Guardrail | Page load time (p95) | 1.8s | +200ms | Must not increase | |
| 40 | | Guardrail | Support ticket rate | 0.4% | +0.1pp | Must not increase | |
| 41 | |
| 42 | ### 3. Sample Size Calculation |
| 43 | |
| 44 | Specify the inputs and the result: |
| 45 | |
| 46 | ``` |
| 47 | Baseline rate: 3.2% |
| 48 | Minimum detectable effect: +0.5pp (absolute) → 3.7% |
| 49 | Significance level (alpha): 0.05 (two-sided) |
| 50 | Power (1 - beta): 0.80 |
| 51 | Sample size per variant: ~14,750 users |
| 52 | Total sample: ~29,500 users |
| 53 | ``` |
| 54 | |
| 55 | State the formula or tool used (e.g., Evan Miller's calculator, statsmodels power analysis). If using a ratio metric or non-binomial outcome, note the test type (t-test, Mann-Whitney, etc.). |
| 56 | |
| 57 | ### 4. Randomization Unit |
| 58 | |
| 59 | Define what gets randomized and how: |
| 60 | |
| 61 | - **Unit:** User-level (most common), session-level, device-level, or cluster-level |
| 62 | - **Method:** Hash-based assignment (deterministic) vs. random draw (non-deterministic) |
| 63 | - **Stickiness:** Users must stay in the same variant across sessions. Specify how (user ID hash, cookie, backend assignment table) |
| 64 | |
| 65 | Flag risks: if randomizing at user level but the feature affects shared resources (e.g., marketplace supply), consider cluster or switchback designs. |
| 66 | |
| 67 | ### 5. Runtime Estimation |
| 68 | |
| 69 | ``` |
| 70 | Daily eligible traffic: ~4,200 users |
| 71 | Sample needed: 29,500 users |
| 72 | Estimated runtime: 8 days (to reach sample size) |
| 73 | Recommended minimum: 14 days (to capture weekly seasonality) |
| 74 | Maximum runtime: 28 days (to avoid novelty effect decay) |
| 75 | ``` |
| 76 | |
| 77 | Always round up to full weeks to account for day-of-week effects. If runtime exceeds 4 weeks, revisit the MDE — you may be trying to detect an effect too small to matter. |
| 78 | |
| 79 | ### 6. Stopping Rules |
| 80 | |
| 81 | Define in advance when and how the experiment ends: |
| 82 | |
| 83 | - **Do not peek** at results before the planned sample size unless using a sequential testing framework (e.g., group sequential design, always-valid p-values) |
| 84 | - **Stop early for harm:** If a guardrail metric degrades beyond a pre-defined threshold (e.g., revenue drops > 5%), stop the experiment regardless of primary metric |
| 85 | - **No early stopping for success** under fixed-horizon testing — a significant p-value at 40% of the sample does not mean the effect is real |
| 86 | - **If using sequential testing:** Specify the spending function (O'Brien-Fleming, Pocock) and planned interim analysis points |
| 87 | |
| 88 | ### 7. Holdout Groups |
| 89 | |
| 90 | When the experiment will lead to a permanent rollout, reserve a holdout: |
| 91 | |
| 92 | - **Size:** 5-10% of eligible traffic, withheld from the winning variant after rollout |
| 93 | - **Purpose:** Measure long-term impact, detect novelty effects wearing off, validate the experiment result in production |
| 94 | - **Duration:** Minimum 4 weeks post-rollout, ideally one full business cycle |
| 95 | |
| 96 | If no holdout is planned, document why (e.g., regulatory requirement to treat all users equally). |
| 97 | |
| 98 | ### 8. Analysis Plan |
| 99 | |
| 100 | Write this before the experiment starts — never after seeing results: |
| 101 | |
| 102 | 1. **Primary analysis:** Compare variant vs. control on the primary metric using [test type]. Report the point estimate, 95% confidence interval, and p-value. |
| 103 | 2. **Guardrail checks:** For each guardrail, confirm the metric did not degrade beyond the threshold. Use one-sided tests where appropriate. |
| 104 | 3. **Segmentation:** Pre-register 2-3 subgroup analyses (e.g., new vs. returning users, mobile vs. desktop). Segments chosen after seeing results are exploratory, not confirmatory. |
| 105 | 4. **Multiple comparisons:** If running more than two variants, apply Bonferroni or Holm correction. State the adjusted alpha. |
| 106 | 5. **Sample Ratio Mismatch (SRM) check:** Verify the actual split matches the intended ratio (chi-square test, p < 0.001 threshold). SRM invalidates the experiment. |
| 107 | |
| 108 | ### 9. Reporting Template |
| 109 | |
| 110 | ``` |
| 111 | Experiment: [Name] |
| 112 | Dates: [Start] – [End] |
| 113 | Variants: Control (50%) vs. Treatment (50%) |
| 114 | Total users: [N] |
| 115 | |
| 116 | Primary metric: Checkout conversion |
| 117 | Control: 3.18% (n = 14,800) |
| 118 | Treatment: 3.71% (n = 14,700) |
| 119 | Difference: +0.53pp (+16.7% relative) |
| 120 | 95% CI: [+0.12pp, +0.94pp] |
| 121 | p-value: 0.011 |
| 122 | |
| 123 | Guardrails: All passed (see appendix) |
| 124 | SRM check: p = 0.42 (no mismatch) |
| 125 | Decision: SHIP / ITERATE / KILL |
| 126 | Rationale: [1-2 sentences] |
| 127 | ``` |
| 128 | |
| 129 | ## Quality checklist |
| 130 | |
| 131 | Before delivering an experiment design, verify: |
| 132 | |
| 133 | - [ ] Hypothesis includes a falsifiable prediction and a causal mechanism |
| 134 | - [ ] Exactly one primary metric is defined — not a composite or a list |
| 135 | - [ ] Guardrail metrics cover revenue, performance, and user experience |
| 136 | - [ ] Sample size calculation includes all inputs (baseline, MDE, alpha, power) |
| 137 | - [ ] Randomization unit matches the unit of analysis (no user-level randomization with session-level metrics without correction) |
| 138 | - [ ] Runtime accounts for weekly seasonality (full-week increments) |
| 139 | - [ ] Stopping rules are defined before the experiment starts, not improvised mid-flight |
| 140 | - [ ] Analysis plan is pre-registered — subgroups and corrections specified in advance |
| 141 | - [ ] SRM check is included in the analysis plan |
| 142 | |
| 143 | ## Common mistakes to avoid |
| 144 | |
| 145 | - **Peeking at results.** Checking significance daily and stopping when p < 0.05 inflates false positive rates to 20-30%. Use sequential testing if you need interim looks, or commit to a fixed horizon. |
| 146 | - **Underpowered tests.** Running an experiment "for a week" regardless of traffic. If you do not have enough sample to detect your MDE, the experiment will almost certainly show "no significant difference" — and you will learn nothing. |
| 147 | - **Multiple comparisons without correction.** Testing 5 variants against control at alpha = 0.05 gives a ~23% chance of at least one false positive. Apply Bonferroni (alpha / k) or use a hierarchical testing procedure. |
| 148 | - **Novelty effects.** New UI elements get extra attention simply for being new. If you measure a lift in week 1, it may vanish by week 3. Run experiments long enough and use holdout groups to validate durability. |
| 149 | - **Post-hoc segmentation as proof.** "It didn't win overall, but it won for mobile users in Germany" is not a valid conclusion — it is hypothesis generation. Pre-register segments or label post-hoc findings as exploratory. |
| 150 | - **Ignoring SRM.** If your 50/50 split is actually 51/49, something in the assignment or logging pipeline is broken. No amount of statistical analysis can fix corrupted randomization. |
| 151 |