dataproduct-management

Experiment Design

Design rigorous A/B tests and product experiments — defining hypotheses, choosing metrics, calculating sample sizes, setting stopping rules, and writing analysis plans that avoid common statistical pitfalls.

a-b-testingexperimentshypothesissample-sizestatisticsproduct-analytics

Works well with agents

AI Engineer Agent Data Scientist Agent Growth Engineer Agent Pricing Strategist Agent Product Analyst Agent Product Operations Agent Prompt Engineer Agent

Works well with skills

Metrics Framework ML Model Evaluation PRD Writing Pricing Analysis Prompt Engineering Guide

$ npx skills add The-AI-Directory-Company/(…) --skill experiment-design

experiment-design/

SKILL.md

Markdown

1
2	# Experiment Design
3
4	## Before you start
5
6	Gather the following from the user. If anything is missing, ask before proceeding:
7
8	1. What change are you testing? (UI change, algorithm tweak, pricing model, new feature rollout)
9	2. What outcome do you expect? (Increase conversion, reduce churn, improve engagement — be specific)
10	3. Who is the target population? (All users, a segment, new users only, specific market)
11	4. What is the current baseline? (Current conversion rate, average revenue, retention rate — with approximate numbers)
12	5. What is the minimum detectable effect (MDE)? (Smallest improvement worth detecting — e.g., +2pp conversion, +5% revenue)
13	6. What is the timeline? (How long can the experiment run before a decision is needed?)
14	7. Are there any constraints? (Traffic volume, seasonality, regulatory requirements, shared infrastructure)
15
16	## Experiment design template
17
18	### 1. Hypothesis
19
20	State a falsifiable hypothesis in this format:
21
22	```
23	If we [change], then [metric] will [direction] by at least [MDE],
24	because [reasoning based on user behavior or data].
25	```
26
27	A hypothesis without a mechanism ("because") is a guess. The mechanism forces you to articulate why the change should work, which informs metric selection and interpretation.
28
29	### 2. Primary Metric + Guardrail Metrics
30
31	Primary metric: One metric that decides the experiment. Exactly one — not two, not "primary and secondary." If you cannot pick one, you do not understand the goal yet.
32
33	Guardrail metrics: 2-4 metrics that must not degrade. These protect against winning on the primary metric at the cost of something else.
34
35	\| Role \| Metric \| Current Baseline \| MDE \| Direction \|
36	\|------\|--------\|-----------------\|-----\|-----------\|
37	\| Primary \| Checkout conversion rate \| 3.2% \| +0.5pp \| Increase \|
38	\| Guardrail \| Revenue per user \| $12.40 \| -$0.50 \| Must not decrease \|
39	\| Guardrail \| Page load time (p95) \| 1.8s \| +200ms \| Must not increase \|
40	\| Guardrail \| Support ticket rate \| 0.4% \| +0.1pp \| Must not increase \|
41
42	### 3. Sample Size Calculation
43
44	Specify the inputs and the result:
45
46	```
47	Baseline rate: 3.2%
48	Minimum detectable effect: +0.5pp (absolute) → 3.7%
49	Significance level (alpha): 0.05 (two-sided)
50	Power (1 - beta): 0.80
51	Sample size per variant: ~14,750 users
52	Total sample: ~29,500 users
53	```
54
55	State the formula or tool used (e.g., Evan Miller's calculator, statsmodels power analysis). If using a ratio metric or non-binomial outcome, note the test type (t-test, Mann-Whitney, etc.).
56
57	### 4. Randomization Unit
58
59	Define what gets randomized and how:
60
61	- Unit: User-level (most common), session-level, device-level, or cluster-level
62	- Method: Hash-based assignment (deterministic) vs. random draw (non-deterministic)
63	- Stickiness: Users must stay in the same variant across sessions. Specify how (user ID hash, cookie, backend assignment table)
64
65	Flag risks: if randomizing at user level but the feature affects shared resources (e.g., marketplace supply), consider cluster or switchback designs.
66
67	### 5. Runtime Estimation
68
69	```
70	Daily eligible traffic: ~4,200 users
71	Sample needed: 29,500 users
72	Estimated runtime: 8 days (to reach sample size)
73	Recommended minimum: 14 days (to capture weekly seasonality)
74	Maximum runtime: 28 days (to avoid novelty effect decay)
75	```
76
77	Always round up to full weeks to account for day-of-week effects. If runtime exceeds 4 weeks, revisit the MDE — you may be trying to detect an effect too small to matter.
78
79	### 6. Stopping Rules
80
81	Define in advance when and how the experiment ends:
82
83	- Do not peek at results before the planned sample size unless using a sequential testing framework (e.g., group sequential design, always-valid p-values)
84	- Stop early for harm: If a guardrail metric degrades beyond a pre-defined threshold (e.g., revenue drops > 5%), stop the experiment regardless of primary metric
85	- No early stopping for success under fixed-horizon testing — a significant p-value at 40% of the sample does not mean the effect is real
86	- If using sequential testing: Specify the spending function (O'Brien-Fleming, Pocock) and planned interim analysis points
87
88	### 7. Holdout Groups
89
90	When the experiment will lead to a permanent rollout, reserve a holdout:
91
92	- Size: 5-10% of eligible traffic, withheld from the winning variant after rollout
93	- Purpose: Measure long-term impact, detect novelty effects wearing off, validate the experiment result in production
94	- Duration: Minimum 4 weeks post-rollout, ideally one full business cycle
95
96	If no holdout is planned, document why (e.g., regulatory requirement to treat all users equally).
97
98	### 8. Analysis Plan
99
100	Write this before the experiment starts — never after seeing results:
101
102	1. Primary analysis: Compare variant vs. control on the primary metric using [test type]. Report the point estimate, 95% confidence interval, and p-value.
103	2. Guardrail checks: For each guardrail, confirm the metric did not degrade beyond the threshold. Use one-sided tests where appropriate.
104	3. Segmentation: Pre-register 2-3 subgroup analyses (e.g., new vs. returning users, mobile vs. desktop). Segments chosen after seeing results are exploratory, not confirmatory.
105	4. Multiple comparisons: If running more than two variants, apply Bonferroni or Holm correction. State the adjusted alpha.
106	5. Sample Ratio Mismatch (SRM) check: Verify the actual split matches the intended ratio (chi-square test, p < 0.001 threshold). SRM invalidates the experiment.
107
108	### 9. Reporting Template
109
110	```
111	Experiment: [Name]
112	Dates: [Start] – [End]
113	Variants: Control (50%) vs. Treatment (50%)
114	Total users: [N]
115
116	Primary metric: Checkout conversion
117	Control: 3.18% (n = 14,800)
118	Treatment: 3.71% (n = 14,700)
119	Difference: +0.53pp (+16.7% relative)
120	95% CI: [+0.12pp, +0.94pp]
121	p-value: 0.011
122
123	Guardrails: All passed (see appendix)
124	SRM check: p = 0.42 (no mismatch)
125	Decision: SHIP / ITERATE / KILL
126	Rationale: [1-2 sentences]
127	```
128
129	## Quality checklist
130
131	Before delivering an experiment design, verify:
132
133	- [ ] Hypothesis includes a falsifiable prediction and a causal mechanism
134	- [ ] Exactly one primary metric is defined — not a composite or a list
135	- [ ] Guardrail metrics cover revenue, performance, and user experience
136	- [ ] Sample size calculation includes all inputs (baseline, MDE, alpha, power)
137	- [ ] Randomization unit matches the unit of analysis (no user-level randomization with session-level metrics without correction)
138	- [ ] Runtime accounts for weekly seasonality (full-week increments)
139	- [ ] Stopping rules are defined before the experiment starts, not improvised mid-flight
140	- [ ] Analysis plan is pre-registered — subgroups and corrections specified in advance
141	- [ ] SRM check is included in the analysis plan
142
143	## Common mistakes to avoid
144
145	- Peeking at results. Checking significance daily and stopping when p < 0.05 inflates false positive rates to 20-30%. Use sequential testing if you need interim looks, or commit to a fixed horizon.
146	- Underpowered tests. Running an experiment "for a week" regardless of traffic. If you do not have enough sample to detect your MDE, the experiment will almost certainly show "no significant difference" — and you will learn nothing.
147	- Multiple comparisons without correction. Testing 5 variants against control at alpha = 0.05 gives a ~23% chance of at least one false positive. Apply Bonferroni (alpha / k) or use a hierarchical testing procedure.
148	- Novelty effects. New UI elements get extra attention simply for being new. If you measure a lift in week 1, it may vanish by week 3. Run experiments long enough and use holdout groups to validate durability.
149	- Post-hoc segmentation as proof. "It didn't win overall, but it won for mobile users in Germany" is not a valid conclusion — it is hypothesis generation. Pre-register segments or label post-hoc findings as exploratory.
150	- Ignoring SRM. If your 50/50 split is actually 51/49, something in the assignment or logging pipeline is broken. No amount of statistical analysis can fix corrupted randomization.
151