A/B Testing Without False Positives: Power, MDE, and Guardrails

Thesis: Most A/B programs don’t fail because the math is wrong — they fail because the design allows false positives to sneak in. Stop chasing p‑values and build a system: right‑size Minimum Detectable Effect (MDE), pre‑commit power, and enforce guardrails that make peeking and noise unprofitable.

1) What we’re optimizing for

Type I error (α): Shipping a loser (false positive). Default 5% is fine — if you obey the rules.
Type II error (β): Not shipping a winner (false negative). Power = 1−β. Typical targets: 80% or 90%.
Business objective: Minimize long‑run regret (lost revenue + user harm) under traffic and time constraints.

Principle: Choose α and power before you see data, and size the test for a meaningful effect (MDE), not for whatever the data happens to show.

2) Minimum Detectable Effect (MDE)

Definition: The smallest effect you care to detect and act on. Set it by economics, not ego.

Absolute MDE (Δ): Difference in rate (e.g., +0.3 pp on a 3.0% baseline).
Relative MDE (r): Percent lift (e.g., +10% over 3.0% → 3.3%).

How to set MDE:

Compute the break‑even lift that covers implementation cost and opportunity cost.
Round up to a clean number; if the rounded MDE is larger than you like, you need more traffic/time or stronger variance reduction.

3) Power and sample size (two‑sample proportion, two‑sided α)

Let p0 be baseline rate and p1 = p0×(1+r). With α (e.g., 0.05) and desired power (e.g., 0.80):

Per‑variant sample size ≈
  [ z(1−α/2)*√(2* p̄ *(1−p̄)) + z(power)*√(p0*(1−p0)+p1*(1−p1)) ]² / (p1 − p0)²
Where p̄ = (p0+p1)/2

Quick MDE approximation (solve for Δ given n):

MDE_abs ≈ ( z(1−α/2) + z(power) ) * √( 2*p0*(1−p0)/n )

Good enough for planning; exact solves require iteration because p̄ depends on p1.

Reference table — per variant n (α=0.05 two‑sided, power=0.80)

Baseline p0	+10% rel	+20% rel	+30% rel	+50% rel
1%	163,095	42,693	19,826	7,750
2%	80,681	21,108	9,797	3,825
3%	53,210	13,914	6,454	2,517
5%	31,233	8,158	3,780	1,470
10%	14,751	3,841	1,774	686
20%	6,509	1,682	771	293

Interpretation: At low baselines, small relative lifts demand big samples. If you can’t reach that n in a reasonable time, raise MDE, pick a higher‑impact surface, or apply variance reduction.

4) Variance reduction (get power without waiting months)

Stratified randomization: Balance traffic on known covariates (device, geo, new vs returning).
Covariate adjustment (CUPED/CUPAC): Use pre‑period or historical activity as a covariate in your estimator; expect 10–40% variance cuts.
Ratio metrics: Normalize by exposures/visits when appropriate.
Filtered windows: Exclude known unstable cohorts (e.g., day‑0 bots) with a pre‑registered rule.
Stable OEC: Use one Objective Evaluation Criteria metric and treat others as secondary.

5) Guardrails that stop false positives

Pre‑registration (one pager): hypothesis, primary metric, α, power, MDE, unit of randomization, traffic split, duration rule, stopping rule, and analysis plan.

Invariant metrics (must be equal across arms):

Traffic allocation (no SRM, sample‑ratio mismatch)
Eligibility/exposure counts
Obvious invariants (gender/geo/device proportions)

Auto‑stop conditions:

SRM check (Chi‑square p < 0.001) → stop and investigate.
Performance guardrails (latency + errors) exceed thresholds.
User harm guardrails (refunds/complaints) breach limits.

Multiple testing controls:

Across variants or many metrics: control FWER (Holm–Bonferroni) or FDR (Benjamini–Hochberg) depending on appetite.
Across many simultaneous experiments: hold program‑level α spending (stagger launches, or hierarchical modeling).

Peeking & early stopping:

Don’t look unless your method supports it. For frequentist tests, use group‑sequential boundaries (Pocock/O’Brien‑Fleming) or alpha‑spending.
Alternatively, use always‑valid sequential methods or Bayesian decision rules with pre‑set utility/costs.

Rollouts:

After “win,” ship to 50–90% hold‑out for 1–2 weeks to confirm stability and catch novelty effects.

6) Design decisions that matter

Unit of randomization: user > session > pageview (avoid cross‑over contamination).
Exposure policy: users stick to the same arm for the full test.
Traffic allocation: 50/50 for power unless you have strong reasons to skew.
Duration: full‑week multiples to cover weekday/weekend cycles.
Seasonality/novelty: avoid high‑variance periods for subtle effects; for big bets, run longer.

7) The analysis plan (frequentist example)

Data QC: eligibility counts; arm sizes; SRM; metric distribution checks.
Primary test: two‑sample z (proportions) or t/Welch (means); two‑sided α unless the decision is truly one‑sided.
Effect size reporting: absolute Δ, relative %, and confidence interval.
Decision rule: Ship only if CI lower bound ≥ 0 (or ≥ business MDE if you require practical significance).
Sensitivity: Heterogeneity by pre‑registered segments only.
Documentation: result + raw queries + QC screenshots.

8) AA tests: when and how

Run AA occasionally to validate instrumentation and randomization.

Expect ~5% “significant” by chance at α=0.05; don’t overreact.
Use AA to calibrate SRM alarms and latency guardrails; don’t block the program waiting for a perfect AA.

9) Common failure modes (and fixes)

Under‑powered + peeking → inflate false positives. Fix: commit duration; use sequential corrections if you must peek.
Metric creep: declaring victory on a secondary metric you didn’t pre‑register. Fix: OEC + hierarchy.
Unit mismatch: randomize by pageview, evaluate by user. Fix: align unit or use cluster‑robust errors.
Interference: users influence each other (e.g., referrals). Fix: cluster by source or switch surface.
Non‑stationarity: promo calendar or algorithm updates mid‑test. Fix: pause or model with time fixed effects.

10) Practical templates

Pre‑registration (copy):

Hypothesis: Changing X increases primary metric by ≥ r% for eligible population.
Primary metric: e.g., SQL rate per eligible visitor (user‑level).
α / Power / MDE: 0.05 / 0.80 / +10% relative.
Unit: user; sticky assignment; 50/50 split.
Duration: 3 full weeks; no peeking; one interim safety check for guardrails only.
Stopping rule: stop only at planned end unless a guardrail triggers.
Analysis: two‑sided z‑test; CI + practical sig ≥ MDE.

SRM check (2‑arm):

Chi‑square with 1 df on counts A vs B; alert if p < 0.001.

Post‑win rollout:

80/20 hold‑out for 14 days; same OEC; must not degrade > X%.

11) Spreadsheet snippets (Google Sheets)

Inputs: p0 (baseline), rel (relative MDE), alpha, power, n (per arm)

Per‑arm sample size for proportions (two‑sided):

=LET(p0,B2, rel,B3, alpha,B4, pow,B5,
 p1,p0*(1+rel), pbar,(p0+p1)/2,
 za, NORM.S.INV(1-alpha/2), zb, NORM.S.INV(pow),
 num,(za*SQRT(2*pbar*(1-pbar)) + zb*SQRT(p0*(1-p0)+p1*(1-p1)))^2,
 den,(p1-p0)^2, num/den)

Quick MDE (absolute) given n:

=LET(p0,B2, alpha,B4, pow,B5, n,B6,
 (NORM.S.INV(1-alpha/2)+NORM.S.INV(pow))*SQRT(2*p0*(1-p0)/n))

12) Narrative for stakeholders (copy‑ready)

Why we can trust this result: We pre‑registered the hypothesis, sized for a business‑meaningful MDE with 80% power, and ran for three full weeks without peeking. Allocation passed SRM. Guardrails stayed within thresholds. The 95% CI on the primary metric is [a,b], entirely above zero and above our MDE, so we expect similar lift at rollout. We’ll confirm with a 14‑day 80/20 hold‑out.

13) Program checklist (print this)

Hypothesis and OEC agreed
MDE and power set from economics
Sample size computed and feasible
Unit + allocation decided; sticky assignment in place
Guardrails and SRM alarms wired
Pre‑registration filed
Start/stop dates booked (full weeks)
QC dashboard ready
Post‑win rollout plan drafted

14) If you can’t hit sample size

Raise MDE to what is economically meaningful.
Move the test to a higher‑traffic surface (or broader eligibility).
Consolidate small tests into one bigger, causal change.
Apply variance reduction (CUPED) or a better OEC.
Consider switchback or time‑series designs for hard‑to‑randomize systems.

15) TL;DR

Pick an MDE you’d actually act on, size for 80–90% power, and enforce guardrails (SRM, invariant checks, alpha‑spending or no peeking, multiple‑testing control). That’s how you keep false positives out — and ship wins you can defend.

Working Hours

A/B Testing Without False Positives: Power, MDE, and Guardrails