dataproduct-management

Experiment Design

Design rigorous A/B tests and product experiments — defining hypotheses, choosing metrics, calculating sample sizes, setting stopping rules, and writing analysis plans that avoid common statistical pitfalls.

a-b-testingexperimentshypothesissample-sizestatisticsproduct-analytics

Works well with agents

AI Engineer Agent Data Scientist Agent Growth Engineer Agent Pricing Strategist Agent Product Analyst Agent Product Operations Agent Prompt Engineer Agent

Works well with skills

Metrics Framework ML Model Evaluation PRD Writing Pricing Analysis Prompt Engineering Guide

$ npx skills add The-AI-Directory-Company/(…) --skill experiment-design

experiment-design/

onboarding-flow-test.md

Markdown

1	# Experiment Design: Guided Onboarding Flow
2
3	## Hypothesis
4
5	If we replace the current self-serve onboarding (5-step form) with an interactive guided tour that demonstrates core features in context, then the 14-day activation rate will increase by at least 5 percentage points, because users who experience value during onboarding are more likely to complete setup and return within the first two weeks.
6
7	## Metrics
8
9	\| Role \| Metric \| Baseline \| MDE \| Direction \|
10	\|------\|--------\|----------\|-----\|-----------\|
11	\| Primary \| 14-day activation rate (user completes 2+ core actions within 14 days of signup) \| 32% \| +5pp \| Increase \|
12	\| Guardrail \| Onboarding completion rate \| 68% \| -5pp \| Must not decrease \|
13	\| Guardrail \| Support ticket rate (first 14 days) \| 3.1% \| +1pp \| Must not increase \|
14	\| Guardrail \| Time-to-complete onboarding \| 4.2 min \| +2 min \| Must not increase \|
15
16	## Sample Size Calculation
17
18	```
19	Baseline rate: 32%
20	Minimum detectable effect: +5pp (absolute) -> 37%
21	Significance level (alpha): 0.05 (two-sided)
22	Power (1 - beta): 0.80
23	Sample size per variant: ~1,530 users
24	Total sample: ~3,060 users
25	Tool: Evan Miller sample size calculator (two-proportion z-test)
26	```
27
28	## Randomization
29
30	- Unit: User-level (new signups only)
31	- Method: Hash-based assignment on user ID (deterministic)
32	- Stickiness: User ID hash persists across sessions; variant stored in `experiments.assignments` table
33	- Exclusions: Users on Enterprise plans (they receive white-glove onboarding)
34
35	## Runtime Estimation
36
37	```
38	Daily eligible signups: ~180 users
39	Sample needed: 3,060 users
40	Estimated runtime: 17 days (to reach sample size)
41	Recommended minimum: 21 days (3 full weeks for day-of-week coverage)
42	Maximum runtime: 35 days (cap to avoid novelty decay)
43	```
44
45	## Stopping Rules
46
47	- No peeking: Fixed-horizon design. Do not evaluate primary metric significance until day 21.
48	- Stop for harm: If support ticket rate exceeds 5% (rolling 3-day average) in the treatment group, halt the experiment and revert all users to control.
49	- No early stopping for success: Even if results look significant at day 10, run to the planned horizon.
50
51	## Holdout Group
52
53	- Size: 5% of eligible traffic withheld from the winning variant after rollout
54	- Duration: 6 weeks post-rollout
55	- Purpose: Confirm the activation lift is durable and not a novelty effect
56
57	## Analysis Plan
58
59	1. Primary analysis: Two-proportion z-test on 14-day activation rate. Report point estimate, 95% CI, and p-value.
60	2. Guardrail checks: One-sided tests confirming no degradation beyond thresholds.
61	3. Pre-registered segments:
62	- Free vs. Pro plan signups
63	- Mobile vs. desktop first session
64	4. SRM check: Chi-square test on variant assignment ratio. Threshold: p < 0.001 indicates pipeline issue — invalidate results.
65	5. Post-hoc (labeled exploratory): Funnel analysis on which guided tour steps have the highest drop-off.
66
67	## Decision Framework
68
69	\| Outcome \| Action \|
70	\|---------\|--------\|
71	\| Primary wins, guardrails pass \| Ship to 100%, maintain holdout \|
72	\| Primary wins, guardrail fails \| Investigate the degraded guardrail, iterate on treatment \|
73	\| Primary neutral \| Kill the experiment, extract learnings from funnel analysis \|
74	\| Primary loses \| Kill the experiment, post-mortem on hypothesis \|
75