dataproduct-management
Experiment Design
Design rigorous A/B tests and product experiments — defining hypotheses, choosing metrics, calculating sample sizes, setting stopping rules, and writing analysis plans that avoid common statistical pitfalls.
a-b-testingexperimentshypothesissample-sizestatisticsproduct-analytics
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill experiment-designexperiment-design/
onboarding-flow-test.md
Markdown
| 1 | # Experiment Design: Guided Onboarding Flow |
| 2 | |
| 3 | ## Hypothesis |
| 4 | |
| 5 | If we replace the current self-serve onboarding (5-step form) with an interactive guided tour that demonstrates core features in context, then the 14-day activation rate will increase by at least 5 percentage points, because users who experience value during onboarding are more likely to complete setup and return within the first two weeks. |
| 6 | |
| 7 | ## Metrics |
| 8 | |
| 9 | | Role | Metric | Baseline | MDE | Direction | |
| 10 | |------|--------|----------|-----|-----------| |
| 11 | | Primary | 14-day activation rate (user completes 2+ core actions within 14 days of signup) | 32% | +5pp | Increase | |
| 12 | | Guardrail | Onboarding completion rate | 68% | -5pp | Must not decrease | |
| 13 | | Guardrail | Support ticket rate (first 14 days) | 3.1% | +1pp | Must not increase | |
| 14 | | Guardrail | Time-to-complete onboarding | 4.2 min | +2 min | Must not increase | |
| 15 | |
| 16 | ## Sample Size Calculation |
| 17 | |
| 18 | ``` |
| 19 | Baseline rate: 32% |
| 20 | Minimum detectable effect: +5pp (absolute) -> 37% |
| 21 | Significance level (alpha): 0.05 (two-sided) |
| 22 | Power (1 - beta): 0.80 |
| 23 | Sample size per variant: ~1,530 users |
| 24 | Total sample: ~3,060 users |
| 25 | Tool: Evan Miller sample size calculator (two-proportion z-test) |
| 26 | ``` |
| 27 | |
| 28 | ## Randomization |
| 29 | |
| 30 | - **Unit**: User-level (new signups only) |
| 31 | - **Method**: Hash-based assignment on user ID (deterministic) |
| 32 | - **Stickiness**: User ID hash persists across sessions; variant stored in `experiments.assignments` table |
| 33 | - **Exclusions**: Users on Enterprise plans (they receive white-glove onboarding) |
| 34 | |
| 35 | ## Runtime Estimation |
| 36 | |
| 37 | ``` |
| 38 | Daily eligible signups: ~180 users |
| 39 | Sample needed: 3,060 users |
| 40 | Estimated runtime: 17 days (to reach sample size) |
| 41 | Recommended minimum: 21 days (3 full weeks for day-of-week coverage) |
| 42 | Maximum runtime: 35 days (cap to avoid novelty decay) |
| 43 | ``` |
| 44 | |
| 45 | ## Stopping Rules |
| 46 | |
| 47 | - **No peeking**: Fixed-horizon design. Do not evaluate primary metric significance until day 21. |
| 48 | - **Stop for harm**: If support ticket rate exceeds 5% (rolling 3-day average) in the treatment group, halt the experiment and revert all users to control. |
| 49 | - **No early stopping for success**: Even if results look significant at day 10, run to the planned horizon. |
| 50 | |
| 51 | ## Holdout Group |
| 52 | |
| 53 | - **Size**: 5% of eligible traffic withheld from the winning variant after rollout |
| 54 | - **Duration**: 6 weeks post-rollout |
| 55 | - **Purpose**: Confirm the activation lift is durable and not a novelty effect |
| 56 | |
| 57 | ## Analysis Plan |
| 58 | |
| 59 | 1. **Primary analysis**: Two-proportion z-test on 14-day activation rate. Report point estimate, 95% CI, and p-value. |
| 60 | 2. **Guardrail checks**: One-sided tests confirming no degradation beyond thresholds. |
| 61 | 3. **Pre-registered segments**: |
| 62 | - Free vs. Pro plan signups |
| 63 | - Mobile vs. desktop first session |
| 64 | 4. **SRM check**: Chi-square test on variant assignment ratio. Threshold: p < 0.001 indicates pipeline issue — invalidate results. |
| 65 | 5. **Post-hoc** (labeled exploratory): Funnel analysis on which guided tour steps have the highest drop-off. |
| 66 | |
| 67 | ## Decision Framework |
| 68 | |
| 69 | | Outcome | Action | |
| 70 | |---------|--------| |
| 71 | | Primary wins, guardrails pass | Ship to 100%, maintain holdout | |
| 72 | | Primary wins, guardrail fails | Investigate the degraded guardrail, iterate on treatment | |
| 73 | | Primary neutral | Kill the experiment, extract learnings from funnel analysis | |
| 74 | | Primary loses | Kill the experiment, post-mortem on hypothesis | |
| 75 |