Thesis: Most A/B programs don’t fail because the math is wrong — they fail because the design allows false positives to sneak in. Stop chasing p‑values and build a system: right‑size Minimum Detectable Effect (MDE), pre‑commit power, and enforce guardrails that make peeking and noise unprofitable.
1) What we’re optimizing for
- Type I error (α): Shipping a loser (false positive). Default 5% is fine — if you obey the rules.
- Type II error (β): Not shipping a winner (false negative). Power = 1−β. Typical targets: 80% or 90%.
- Business objective: Minimize long‑run regret (lost revenue + user harm) under traffic and time constraints.
Principle: Choose α and power before you see data, and size the test for a meaningful effect (MDE), not for whatever the data happens to show.
2) Minimum Detectable Effect (MDE)
Definition: The smallest effect you care to detect and act on. Set it by economics, not ego.
- Absolute MDE (Δ): Difference in rate (e.g., +0.3 pp on a 3.0% baseline).
- Relative MDE (r): Percent lift (e.g., +10% over 3.0% → 3.3%).
How to set MDE:
- Compute the break‑even lift that covers implementation cost and opportunity cost.
- Round up to a clean number; if the rounded MDE is larger than you like, you need more traffic/time or stronger variance reduction.
3) Power and sample size (two‑sample proportion, two‑sided α)
Let p0 be baseline rate and p1 = p0×(1+r). With α (e.g., 0.05) and desired power (e.g., 0.80):
Per‑variant sample size ≈
[ z(1−α/2)*√(2* p̄ *(1−p̄)) + z(power)*√(p0*(1−p0)+p1*(1−p1)) ]² / (p1 − p0)²
Where p̄ = (p0+p1)/2
Quick MDE approximation (solve for Δ given n):
MDE_abs ≈ ( z(1−α/2) + z(power) ) * √( 2*p0*(1−p0)/n )
Good enough for planning; exact solves require iteration because p̄ depends on p1.
Reference table — per variant n (α=0.05 two‑sided, power=0.80)
| Baseline p0 | +10% rel | +20% rel | +30% rel | +50% rel |
|---|---|---|---|---|
| 1% | 163,095 | 42,693 | 19,826 | 7,750 |
| 2% | 80,681 | 21,108 | 9,797 | 3,825 |
| 3% | 53,210 | 13,914 | 6,454 | 2,517 |
| 5% | 31,233 | 8,158 | 3,780 | 1,470 |
| 10% | 14,751 | 3,841 | 1,774 | 686 |
| 20% | 6,509 | 1,682 | 771 | 293 |
Interpretation: At low baselines, small relative lifts demand big samples. If you can’t reach that n in a reasonable time, raise MDE, pick a higher‑impact surface, or apply variance reduction.
4) Variance reduction (get power without waiting months)
- Stratified randomization: Balance traffic on known covariates (device, geo, new vs returning).
- Covariate adjustment (CUPED/CUPAC): Use pre‑period or historical activity as a covariate in your estimator; expect 10–40% variance cuts.
- Ratio metrics: Normalize by exposures/visits when appropriate.
- Filtered windows: Exclude known unstable cohorts (e.g., day‑0 bots) with a pre‑registered rule.
- Stable OEC: Use one Objective Evaluation Criteria metric and treat others as secondary.
5) Guardrails that stop false positives
Pre‑registration (one pager): hypothesis, primary metric, α, power, MDE, unit of randomization, traffic split, duration rule, stopping rule, and analysis plan.
Invariant metrics (must be equal across arms):
- Traffic allocation (no SRM, sample‑ratio mismatch)
- Eligibility/exposure counts
- Obvious invariants (gender/geo/device proportions)
Auto‑stop conditions:
- SRM check (Chi‑square p < 0.001) → stop and investigate.
- Performance guardrails (latency + errors) exceed thresholds.
- User harm guardrails (refunds/complaints) breach limits.
Multiple testing controls:
- Across variants or many metrics: control FWER (Holm–Bonferroni) or FDR (Benjamini–Hochberg) depending on appetite.
- Across many simultaneous experiments: hold program‑level α spending (stagger launches, or hierarchical modeling).
Peeking & early stopping:
- Don’t look unless your method supports it. For frequentist tests, use group‑sequential boundaries (Pocock/O’Brien‑Fleming) or alpha‑spending.
- Alternatively, use always‑valid sequential methods or Bayesian decision rules with pre‑set utility/costs.
Rollouts:
- After “win,” ship to 50–90% hold‑out for 1–2 weeks to confirm stability and catch novelty effects.
6) Design decisions that matter
- Unit of randomization: user > session > pageview (avoid cross‑over contamination).
- Exposure policy: users stick to the same arm for the full test.
- Traffic allocation: 50/50 for power unless you have strong reasons to skew.
- Duration: full‑week multiples to cover weekday/weekend cycles.
- Seasonality/novelty: avoid high‑variance periods for subtle effects; for big bets, run longer.
7) The analysis plan (frequentist example)
- Data QC: eligibility counts; arm sizes; SRM; metric distribution checks.
- Primary test: two‑sample z (proportions) or t/Welch (means); two‑sided α unless the decision is truly one‑sided.
- Effect size reporting: absolute Δ, relative %, and confidence interval.
- Decision rule: Ship only if CI lower bound ≥ 0 (or ≥ business MDE if you require practical significance).
- Sensitivity: Heterogeneity by pre‑registered segments only.
- Documentation: result + raw queries + QC screenshots.
8) AA tests: when and how
Run AA occasionally to validate instrumentation and randomization.
- Expect ~5% “significant” by chance at α=0.05; don’t overreact.
- Use AA to calibrate SRM alarms and latency guardrails; don’t block the program waiting for a perfect AA.
9) Common failure modes (and fixes)
- Under‑powered + peeking → inflate false positives. Fix: commit duration; use sequential corrections if you must peek.
- Metric creep: declaring victory on a secondary metric you didn’t pre‑register. Fix: OEC + hierarchy.
- Unit mismatch: randomize by pageview, evaluate by user. Fix: align unit or use cluster‑robust errors.
- Interference: users influence each other (e.g., referrals). Fix: cluster by source or switch surface.
- Non‑stationarity: promo calendar or algorithm updates mid‑test. Fix: pause or model with time fixed effects.
10) Practical templates
Pre‑registration (copy):
- Hypothesis: Changing X increases primary metric by ≥ r% for eligible population.
- Primary metric: e.g., SQL rate per eligible visitor (user‑level).
- α / Power / MDE: 0.05 / 0.80 / +10% relative.
- Unit: user; sticky assignment; 50/50 split.
- Duration: 3 full weeks; no peeking; one interim safety check for guardrails only.
- Stopping rule: stop only at planned end unless a guardrail triggers.
- Analysis: two‑sided z‑test; CI + practical sig ≥ MDE.
SRM check (2‑arm):
- Chi‑square with 1 df on counts A vs B; alert if p < 0.001.
Post‑win rollout:
- 80/20 hold‑out for 14 days; same OEC; must not degrade > X%.
11) Spreadsheet snippets (Google Sheets)
Inputs: p0 (baseline), rel (relative MDE), alpha, power, n (per arm)
Per‑arm sample size for proportions (two‑sided):
=LET(p0,B2, rel,B3, alpha,B4, pow,B5,
p1,p0*(1+rel), pbar,(p0+p1)/2,
za, NORM.S.INV(1-alpha/2), zb, NORM.S.INV(pow),
num,(za*SQRT(2*pbar*(1-pbar)) + zb*SQRT(p0*(1-p0)+p1*(1-p1)))^2,
den,(p1-p0)^2, num/den)
Quick MDE (absolute) given n:
=LET(p0,B2, alpha,B4, pow,B5, n,B6,
(NORM.S.INV(1-alpha/2)+NORM.S.INV(pow))*SQRT(2*p0*(1-p0)/n))
12) Narrative for stakeholders (copy‑ready)
Why we can trust this result: We pre‑registered the hypothesis, sized for a business‑meaningful MDE with 80% power, and ran for three full weeks without peeking. Allocation passed SRM. Guardrails stayed within thresholds. The 95% CI on the primary metric is [a,b], entirely above zero and above our MDE, so we expect similar lift at rollout. We’ll confirm with a 14‑day 80/20 hold‑out.
13) Program checklist (print this)
- Hypothesis and OEC agreed
- MDE and power set from economics
- Sample size computed and feasible
- Unit + allocation decided; sticky assignment in place
- Guardrails and SRM alarms wired
- Pre‑registration filed
- Start/stop dates booked (full weeks)
- QC dashboard ready
- Post‑win rollout plan drafted
14) If you can’t hit sample size
- Raise MDE to what is economically meaningful.
- Move the test to a higher‑traffic surface (or broader eligibility).
- Consolidate small tests into one bigger, causal change.
- Apply variance reduction (CUPED) or a better OEC.
- Consider switchback or time‑series designs for hard‑to‑randomize systems.
15) TL;DR
Pick an MDE you’d actually act on, size for 80–90% power, and enforce guardrails (SRM, invariant checks, alpha‑spending or no peeking, multiple‑testing control). That’s how you keep false positives out — and ship wins you can defend.
Add comment