dataproduct-management

Experiment Design

Design rigorous A/B tests and product experiments — defining hypotheses, choosing metrics, calculating sample sizes, setting stopping rules, and writing analysis plans that avoid common statistical pitfalls.

a-b-testingexperimentshypothesissample-sizestatisticsproduct-analytics

Works well with agents

AI Engineer AgentData Scientist AgentGrowth Engineer AgentPricing Strategist AgentProduct Analyst AgentProduct Operations AgentPrompt Engineer Agent

Works well with skills

Metrics FrameworkML Model EvaluationPRD WritingPricing AnalysisPrompt Engineering Guide
$ npx skills add The-AI-Directory-Company/(…) --skill experiment-design
experiment-design/
    • onboarding-flow-test.md3.3 KB
  • SKILL.md8.7 KB
SKILL.md
Markdown
1 
2# Experiment Design
3 
4## Before you start
5 
6Gather the following from the user. If anything is missing, ask before proceeding:
7 
81. **What change are you testing?** (UI change, algorithm tweak, pricing model, new feature rollout)
92. **What outcome do you expect?** (Increase conversion, reduce churn, improve engagement — be specific)
103. **Who is the target population?** (All users, a segment, new users only, specific market)
114. **What is the current baseline?** (Current conversion rate, average revenue, retention rate — with approximate numbers)
125. **What is the minimum detectable effect (MDE)?** (Smallest improvement worth detecting — e.g., +2pp conversion, +5% revenue)
136. **What is the timeline?** (How long can the experiment run before a decision is needed?)
147. **Are there any constraints?** (Traffic volume, seasonality, regulatory requirements, shared infrastructure)
15 
16## Experiment design template
17 
18### 1. Hypothesis
19 
20State a falsifiable hypothesis in this format:
21 
22```
23If we [change], then [metric] will [direction] by at least [MDE],
24because [reasoning based on user behavior or data].
25```
26 
27A hypothesis without a mechanism ("because") is a guess. The mechanism forces you to articulate why the change should work, which informs metric selection and interpretation.
28 
29### 2. Primary Metric + Guardrail Metrics
30 
31**Primary metric:** One metric that decides the experiment. Exactly one — not two, not "primary and secondary." If you cannot pick one, you do not understand the goal yet.
32 
33**Guardrail metrics:** 2-4 metrics that must not degrade. These protect against winning on the primary metric at the cost of something else.
34 
35| Role | Metric | Current Baseline | MDE | Direction |
36|------|--------|-----------------|-----|-----------|
37| Primary | Checkout conversion rate | 3.2% | +0.5pp | Increase |
38| Guardrail | Revenue per user | $12.40 | -$0.50 | Must not decrease |
39| Guardrail | Page load time (p95) | 1.8s | +200ms | Must not increase |
40| Guardrail | Support ticket rate | 0.4% | +0.1pp | Must not increase |
41 
42### 3. Sample Size Calculation
43 
44Specify the inputs and the result:
45 
46```
47Baseline rate: 3.2%
48Minimum detectable effect: +0.5pp (absolute) → 3.7%
49Significance level (alpha): 0.05 (two-sided)
50Power (1 - beta): 0.80
51Sample size per variant: ~14,750 users
52Total sample: ~29,500 users
53```
54 
55State the formula or tool used (e.g., Evan Miller's calculator, statsmodels power analysis). If using a ratio metric or non-binomial outcome, note the test type (t-test, Mann-Whitney, etc.).
56 
57### 4. Randomization Unit
58 
59Define what gets randomized and how:
60 
61- **Unit:** User-level (most common), session-level, device-level, or cluster-level
62- **Method:** Hash-based assignment (deterministic) vs. random draw (non-deterministic)
63- **Stickiness:** Users must stay in the same variant across sessions. Specify how (user ID hash, cookie, backend assignment table)
64 
65Flag risks: if randomizing at user level but the feature affects shared resources (e.g., marketplace supply), consider cluster or switchback designs.
66 
67### 5. Runtime Estimation
68 
69```
70Daily eligible traffic: ~4,200 users
71Sample needed: 29,500 users
72Estimated runtime: 8 days (to reach sample size)
73Recommended minimum: 14 days (to capture weekly seasonality)
74Maximum runtime: 28 days (to avoid novelty effect decay)
75```
76 
77Always round up to full weeks to account for day-of-week effects. If runtime exceeds 4 weeks, revisit the MDE — you may be trying to detect an effect too small to matter.
78 
79### 6. Stopping Rules
80 
81Define in advance when and how the experiment ends:
82 
83- **Do not peek** at results before the planned sample size unless using a sequential testing framework (e.g., group sequential design, always-valid p-values)
84- **Stop early for harm:** If a guardrail metric degrades beyond a pre-defined threshold (e.g., revenue drops > 5%), stop the experiment regardless of primary metric
85- **No early stopping for success** under fixed-horizon testing — a significant p-value at 40% of the sample does not mean the effect is real
86- **If using sequential testing:** Specify the spending function (O'Brien-Fleming, Pocock) and planned interim analysis points
87 
88### 7. Holdout Groups
89 
90When the experiment will lead to a permanent rollout, reserve a holdout:
91 
92- **Size:** 5-10% of eligible traffic, withheld from the winning variant after rollout
93- **Purpose:** Measure long-term impact, detect novelty effects wearing off, validate the experiment result in production
94- **Duration:** Minimum 4 weeks post-rollout, ideally one full business cycle
95 
96If no holdout is planned, document why (e.g., regulatory requirement to treat all users equally).
97 
98### 8. Analysis Plan
99 
100Write this before the experiment starts — never after seeing results:
101 
1021. **Primary analysis:** Compare variant vs. control on the primary metric using [test type]. Report the point estimate, 95% confidence interval, and p-value.
1032. **Guardrail checks:** For each guardrail, confirm the metric did not degrade beyond the threshold. Use one-sided tests where appropriate.
1043. **Segmentation:** Pre-register 2-3 subgroup analyses (e.g., new vs. returning users, mobile vs. desktop). Segments chosen after seeing results are exploratory, not confirmatory.
1054. **Multiple comparisons:** If running more than two variants, apply Bonferroni or Holm correction. State the adjusted alpha.
1065. **Sample Ratio Mismatch (SRM) check:** Verify the actual split matches the intended ratio (chi-square test, p < 0.001 threshold). SRM invalidates the experiment.
107 
108### 9. Reporting Template
109 
110```
111Experiment: [Name]
112Dates: [Start] – [End]
113Variants: Control (50%) vs. Treatment (50%)
114Total users: [N]
115 
116Primary metric: Checkout conversion
117 Control: 3.18% (n = 14,800)
118 Treatment: 3.71% (n = 14,700)
119 Difference: +0.53pp (+16.7% relative)
120 95% CI: [+0.12pp, +0.94pp]
121 p-value: 0.011
122 
123Guardrails: All passed (see appendix)
124SRM check: p = 0.42 (no mismatch)
125Decision: SHIP / ITERATE / KILL
126Rationale: [1-2 sentences]
127```
128 
129## Quality checklist
130 
131Before delivering an experiment design, verify:
132 
133- [ ] Hypothesis includes a falsifiable prediction and a causal mechanism
134- [ ] Exactly one primary metric is defined — not a composite or a list
135- [ ] Guardrail metrics cover revenue, performance, and user experience
136- [ ] Sample size calculation includes all inputs (baseline, MDE, alpha, power)
137- [ ] Randomization unit matches the unit of analysis (no user-level randomization with session-level metrics without correction)
138- [ ] Runtime accounts for weekly seasonality (full-week increments)
139- [ ] Stopping rules are defined before the experiment starts, not improvised mid-flight
140- [ ] Analysis plan is pre-registered — subgroups and corrections specified in advance
141- [ ] SRM check is included in the analysis plan
142 
143## Common mistakes to avoid
144 
145- **Peeking at results.** Checking significance daily and stopping when p < 0.05 inflates false positive rates to 20-30%. Use sequential testing if you need interim looks, or commit to a fixed horizon.
146- **Underpowered tests.** Running an experiment "for a week" regardless of traffic. If you do not have enough sample to detect your MDE, the experiment will almost certainly show "no significant difference" — and you will learn nothing.
147- **Multiple comparisons without correction.** Testing 5 variants against control at alpha = 0.05 gives a ~23% chance of at least one false positive. Apply Bonferroni (alpha / k) or use a hierarchical testing procedure.
148- **Novelty effects.** New UI elements get extra attention simply for being new. If you measure a lift in week 1, it may vanish by week 3. Run experiments long enough and use holdout groups to validate durability.
149- **Post-hoc segmentation as proof.** "It didn't win overall, but it won for mobile users in Germany" is not a valid conclusion — it is hypothesis generation. Pre-register segments or label post-hoc findings as exploratory.
150- **Ignoring SRM.** If your 50/50 split is actually 51/49, something in the assignment or logging pipeline is broken. No amount of statistical analysis can fix corrupted randomization.
151 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·