Experiment Design for Marketers: Holdouts, Geo-Splits, and Switchbacks

Thesis: When attribution gets noisy, experiments are how you buy clarity. For channels that can’t randomize at the user level (TV/CTV/OOH, upper‑funnel social, brand search), the right tool is holdouts, geo‑splits, and switchbacks—run with enough power, clean guardrails, and an analysis plan you can defend.

1) What each design is good for

Design	Unit	Best for	Pros	Watch‑outs
Holdout (user or audience)	Users/Audience/Brand terms	Email/CRM, retargeting, brand search cannibalization	Simple, quick reads; low ops burden	Leakage between lists; platform modeling can muddy counts
Geo‑split	Markets/regions (DMAs, countries)	Meta upper‑funnel, YouTube/CTV/OOH	Causal, close to business reality	Needs matched markets; spillovers across borders; requires scale
Switchback	Same market toggled on/off by period	Operational changes, pricing, app‑wide promos	Avoids cross‑market bias; needs fewer markets	Requires stationarity; sensitive to seasonality and events

Rule of thumb: If you can randomize users → do that. If not, choose geo‑split when you can isolate markets; choose switchback when you can toggle reliably and markets are scarce.

2) Pre‑registration (one page)

Hypothesis: Turning ON channel X increases primary metric by ≥ r% for eligible population.
Design: holdout / geo‑split / switchback; unit; assignment method.
Windows: lookback/exposure, conversion lag, blackout periods.
α / Power / MDE: 0.05 / 0.80–0.90 / r%.
Guardrails: SRM, latency, error rates, budget caps, brand safety.
Stopping rule: run full duration unless safety guardrail trips.
Analysis plan: estimator, covariates, cluster‑robust errors, decision rule.
Rollout plan: post‑win holdout (10–20%) for 1–2 weeks.

3) Sizing for Power (fast, practical math)

A) User‑level holdouts (proportions)

Let baseline p0, target relative lift r, α (two‑sided), power. Approx per‑arm n:

n ≈ [ z(1−α/2)*√(2* p̄ *(1−p̄)) + z(power)*√(p0*(1−p0)+p1*(1−p1)) ]² / (p1 − p0)²

Where p1 = p0*(1+r), p̄ = (p0+p1)/2

Quick MDE given n:

MDE_abs ≈ ( z(1−α/2) + z(power) ) * √( 2*p0*(1−p0)/n )

B) Clustered designs (geo & switchback)

Variance inflates by the design effect (DE): DE = 1 + (m − 1)*ICC where m is avg users per cluster‑period, ICC is intra‑cluster correlation (start 0.01–0.05 for sales‑like metrics). Use effective n: n_eff = n_users / DE or size on cluster‑periods:

SE(Δ) ≈ σ_g / √(G) (difference-in-means on geo-week averages)

MDE ≈ ( z(1−α/2) + z(power) ) * σ_g / √(G)

Where σ_g is the SD of the primary metric at the geo‑week level in pre‑period, G is treated + control geo‑weeks. Increase G by adding markets or extending weeks.

Thumb‑rules:

Aim for ≥ 20 markets per arm or ≥ 8–12 weeks with stable demand.
If you have few markets, prefer switchback with multiple on/off cycles per market.

4) Assignment & Guardrails

Randomization:

Holdout: random sample of users/audiences/keywords.
Geo: pair markets by pre‑period KPI, population, and ad supply (use matching or stratification); randomly assign within pairs.
Switchback: choose period length longer than attribution lag (e.g., 1–2 weeks); run ≥ 4 cycles.

Guardrails (auto‑stop if breached):

SRM (sample ratio mismatch) for holdouts.
Performance: latency & error budgets; brand safety incidents.
Budget drift: spend ±15% vs plan per arm.
Seasonality: major promo/holiday detected → extend or pause.
Spillover: abnormal traffic shifts in adjacent control markets.

5) Windows & Measurement

Exposure window: when the ad/channel can influence users (e.g., ad delivery week).
Conversion window: time to count outcomes (e.g., +7d, +28d).
Attribution alignment: measure incremental outcomes not platform‑reported conversions alone; reconcile with server events.

6) Analysis Plans (pick one and stick to it)

A) Difference‑in‑means on aggregated units

Compute per unit‑period (geo‑week) averages; compare arms with cluster‑robust SEs.

Δ = mean(y_treated) − mean(y_control)

Report: point estimate, 95% CI, p‑value

B) Difference‑in‑Differences (DiD)

Useful when baseline levels differ.

y ~ β0 + β1*Post + β2*Treated + β3*(Post×Treated) + ε

Lift = β3

Cluster SE at geo level; add week fixed effects if needed.

C) Switchback regression

Include market and period fixed effects; treatment toggles within each market.

y_gw ~ μ + α_g (market FE) + τ_w (week FE) + θ*Treat_gw + ε_gw

Lift = θ (cluster SE by market)

Decision rule: Ship if CI lower bound ≥ 0 (or ≥ business MDE) and guardrails stayed green.

7) Incremental ROAS & Payback

Compute incremental outcomes ΔY and incremental spend ΔSpend.

incROAS = (Revenue_per_outcome * ΔY) / ΔSpend

Payback (months) = CAC / Gross Profit per Month

Set go/no‑go thresholds before launch (e.g., incROAS ≥ 1.5 or payback ≤ 12 months).

8) Worked Examples

A) Brand Search Cannibalization (Holdout)

Split brand keywords; 20% held out for 3 weeks.
Primary metric: incremental conversions from brand ads.
Typical result: some % of brand clicks were cannibalized by organic/direct; re‑invest savings into non‑brand or creative.

B) Meta Upper‑Funnel (Geo‑split)

24 DMAs matched on revenue/seasonality; 12 test receive prospecting; 12 control pause.
Primary metric: qualified leads/week; analysis: DiD with week fixed effects.
Outcome: incremental lead cost and mROAS; adjust budgets 10–20%.

C) CTV Flight (Switchback)

One country; 6 on/off weeks alternating.
Primary metric: branded search volume & direct signups; analysis: FE regression.
Outcome: detect short‑term lift; plan national rollout with 10% holdout.

9) Common Failure Modes (and fixes)

Peeking/early stopping → inflated false positives. Fix: commit duration or use alpha‑spending.
Market spillover (test ads seen in control). Fix: use larger buffers, exclude border zips, monitor adjacency metrics.
Power too low → noisy zero. Fix: raise MDE, extend weeks, pool outcomes.
Platform modeling drift vs server logs. Fix: treat platform as one view; ground truth = server‑side outcomes.
Holiday/promos distort trends. Fix: avoid or include FE and run longer.
Unit mismatch (assign at geo, measure at user without clustering). Fix: analyze on the assignment unit or cluster SEs.

10) Implementation Checklist (print this)

11) Google Sheets Snippets

Per‑arm n for user holdout (proportions):

=LET(p0,B2, rel,B3, alpha,B4, pow,B5,

p1,p0*(1+rel), pbar,(p0+p1)/2,

za, NORM.S.INV(1-alpha/2), zb, NORM.S.INV(pow),

num,(za*SQRT(2*pbar*(1-pbar)) + zb*SQRT(p0*(1-p0)+p1*(1-p1)))^2,

den,(p1-p0)^2, num/den)

MDE given n:

=LET(p0,B2, alpha,B4, pow,B5, n,B6,

(NORM.S.INV(1-alpha/2)+NORM.S.INV(pow))*SQRT(2*p0*(1-p0)/n))

Design effect (clustered):

=1 + (B2-1)*B3 // B2=m users/cluster, B3=ICC

12) Deliverables (what stakeholders see)

One‑pager (hypothesis, design, power, windows, guardrails).
QA dashboard (SRM, spend split, exposure counts, invariants).
Results deck (point estimate + 95% CI, DiD table, incROAS, decision).
Rollout brief (spend reallocation, holdout plan, monitoring).

13) 30‑60‑90 Day Program Plan

Days 1–30: Build experiment brief template; choose 2 priority surfaces (brand search holdout + Meta geo); set power calculators; create QA dashboard.
Days 31–60: Run 1–2 studies; publish results with CIs and decisions; start switchback design for CTV.
Days 61–90: Institutionalize cadence (monthly readout); tie MMM priors to lift results; expand to 2 new markets or another channel.

Working Hours

Experiment Design for Marketers: Holdouts, Geo-Splits, and Switchbacks