Thesis: Dashboards don’t fail because they’re ugly—they fail because the data contract is weak. Data quality debt accrues interest as broken trust, bad decisions, and rework. The cure isn’t “more charts,” it’s a Decision‑First Data OS: contracts → tests → observability → runbooks → governance tied to the business.
1) Definitions
- Data Quality Debt: The compounding cost of missing contracts, ambiguous metrics, and untested pipelines. It grows every time a new dataset is shipped without owners, tests, or SLAs.
- Dashboard Theater: Attractive but untrusted charts that optimize for presentation, not decision. Symptoms: weekly copy‑paste rituals, conflicting numbers across teams, “can you export this to CSV?” requests.
2) The Decision‑First Data OS (overview)
- Decision & Metric Contracts — Start from the decision: what will change if the number moves? Create a metric one‑pager (definition, query, owner, thresholds, caveats).
- Data Contracts — Formal schemas at source boundaries; typed fields; PII policy; breaking‑change rules.
- Quality Tests — At ingest, transform, and publish layers; promote only if tests pass.
- Observability & SLAs — Freshness, volume, nulls, distribution drift, lineage, error budgets.
- Runbooks — Incident severities, rollback/backfill steps, communication templates.
- Governance — RACI, change logs, dashboard lifecycle (birth → usage → retirement).
3) Quality Dimensions → SLOs (set targets you can defend)
| Dimension | What it means | Example SLO | Guardrail |
|---|---|---|---|
| Freshness | Data arrives on time | orders ≤ 20 min delay p95 | Freeze downstream if > 1h |
| Completeness | Expected rows/fields present | Day‑1 coverage ≥ 99.5% | Alert at –1σ vs 4‑week mean |
| Validity | Values match type/range | price_cents ≥ 0, ISO dates | Drop/flag invalid rows |
| Uniqueness | No duplicates where keys should be unique | order_id unique/day | Hard fail build |
| Consistency | Same business rule across tables | revenue = qty×price | Diff check < 0.5% |
| Accuracy | Matches ground truth/source of record | Payment totals within ±0.2% | Reconcile daily |
| Lineage | Trace from metric → sources | Auto‑updated graph | Required for sign‑off |
Error budget: percent of time an SLO can be breached before a change freeze (e.g., 1% monthly for critical tables).
4) Test Matrix (attach tests where they matter)
Ingest (raw → staged)
- Schema checksum, column count, types, required fields non‑null.
- Volume vs 4‑week rolling mean; outlier guardrails.
- PII scan against allowlist.
Transform (staged → modeled)
- Primary/foreign‑key constraints (surrogate keys OK).
- Valid ranges & enums; referential integrity.
- Business rules:
gross = net + tax + shipping(tolerances). - Slowly changing dims: no retroactive key churn.
Publish (modeled → marts/BI)
- Metric recon: day‑over‑day diffs within tolerance.
- Freshness SLO checks; row‑level sample audits.
- Experiment invariants (SRM) where applicable.
Naming tests<table>::<layer>::<dimension> → orders::transform::uniqueness_order_id
5) Metric One‑Pager (template)
- Name: Monthly Active Accounts (MAA)
- Definition: Distinct
account_idwith ≥1core_actionin calendar month - SQL: link to versioned query
- Owner: Analytics → Jane D.
- Consumers: Exec weekly, Growth monthly
- Guardrails: event latency p95 < 300ms; SRM alarms in experiments
- Caveats: excludes sandbox/test tenants; backfills marked
- Change log: YYYY‑MM‑DD reason, PR link
6) Data Contract (YAML stub)
name: orders
owner: data-platform
schema:
order_id: {type: string, constraints: [primary_key]}
account_id: {type: string, constraints: [not_null]}
price_cents: {type: integer, constraints: [min: 0]}
currency: {type: string, constraints: [enum: [USD, EUR, CAD]]}
created_at: {type: timestamp, constraints: [not_null]}
slas:
freshness_p95_minutes: 20
completeness_min_pct: 99.5
breaking_change_protocol: require-new-column; deprecate-after: 30d
pii_policy: forbid
7) Observability Signals (what to alert on)
- Freshness lag vs SLO
- Row count anomaly (±3σ or robust z‑score)
- Null ratio drift per critical column
- Dimension cardinality spikes (e.g.,
statusunexpected values) - Join key coverage (
user_id,account_id,session_id) - Metric diffs vs prior day/week beyond tolerance
- Lineage break (upstream failure)
Alert routing: Sev‑1 (payments, revenue) → Pager on‑call; Sev‑2 (product analytics) → Slack + ticket; Sev‑3 (ad‑hoc) → daily digest.
8) Runbooks (when—not if—things break)
Sev‑1 checklist
- Announce incident with scope & affected metrics.
- Stop the bleeding: halt downstream builds; set banner on BI with timestamp.
- Identify regression PR; roll back or hotfix; re‑run backfill with checkpoints.
- Reconcile vs source of record; attach evidence.
- Close with RCA doc and prevention action (test added, contract updated).
Communication template
Status: Data incident Sev‑1. Impact: revenue dashboards stale since 09:40 UTC. ETA: 30–45m. Workaround: export from billing app if urgent. Next update: 10:15 UTC.
9) Dashboard Lifecycle (end the theater)
- Birth: requires metric one‑pager + owner + SLOs.
- Usage SLO: viewed by ≥2 teams or ≥N users/month.
- Review: quarterly redundancy check; consolidate overlapping boards.
- Retirement: archive if usage below SLO for 2 cycles or metric deprecated.
- Badge trust: green (all SLOs), yellow (minor breach), red (do not use).
10) Program KPIs (measure the Data OS)
- Trust score: share of dashboards green
- MTTD/MTTR: mean time to detect/resolve incidents
- Test coverage: % critical tables with ≥1 test per dimension
- SLO adherence: error budget burn/month
- Dashboard count: net reduction of redundant boards
- Decision velocity: time from question → approved metric → decision
11) 30‑60‑90 Day Plan
Days 1–30 — Inventory critical tables/metrics; write metric one‑pagers; add basic tests (schema, nulls, uniqueness); set SLOs; add freshness/volume alerts.
Days 31–60 — Implement data contracts on top 5 sources; expand tests to validity/consistency; wire lineage; publish runbooks; badge dashboards.
Days 61–90 — Error budgets + change freeze policy; quarterly dashboard review; add distribution‑drift detection; tie a11y/latency to BI performance.
12) SQL & dbt Snippets
Freshness check (BigQuery)
SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(created_at), MINUTE) AS minutes_since_last
FROM raw.orders;
dbt tests (schema.yml fragment)
models:
- name: fct_orders
tests:
- unique:
column_name: order_id
- not_null:
column_name: account_id
- relationships:
to: dim_accounts
field: account_id
- accepted_values:
column_name: currency
values: [USD, EUR, CAD]
13) Anti‑Theater Checklist (print this)
- Metric has one‑pager, owner, versioned SQL
- Table has contract, tests, SLOs
- Observability alerts wired (freshness, volume, nulls, diffs)
- Dashboard shows data freshness badge
- Runbook link visible on dashboard
- Retirement criteria scheduled
14) ROI Model (quick math)
Let I be incidents/month, H hours/incident, C blended cost/hour, R revenue at risk/day, p probability a Sev‑1 misguides a decision.
Monthly cost of data debt ≈ I × H × C + p × R.
Even halving I or p via SLOs/tests often pays for the Data OS in < 1 quarter.
15) Comms Kit
Exec one‑liner:
“We don’t ship dashboards without contracts, tests, and owners. If the trust badge is red, we don’t use it to decide.”
LinkedIn post (short):
Dashboard theater dies when your data has a contract. Define the decision, write the metric one‑pager, enforce SLOs (freshness, completeness, validity), and add runbooks. Trust goes up, rework goes down.
16) SEO Kit
- Title (≤60): Data Quality Debt: Avoiding Dashboard Theater
- Meta (≤160): A decision‑first playbook to kill dashboard theater—data contracts, quality tests, observability, runbooks, and governance linked to business outcomes.
- Slug:
/data-quality-debt-dashboard-theater - Keywords: data quality debt, dashboard theater, data contracts, data quality tests, freshness completeness validity, data observability, lineage, data SLA, error budgets, dbt tests
17) Image Briefs
- Cover: Editorial 3D—two scenes split diagonally: left shows glossy dashboards with a red trust badge; right shows a contract, checklist, and green shield over a lineage graph. Cool blue palette, subtle grid.
- Diagram: Data OS flow: Decisions → Metric Contracts → Data Contracts → Tests → Observability → Runbooks → Governance.
Bottom line: Treat dashboards as outputs of a robust data system, not the system itself. Contracts, tests, and SLAs turn data from theater into decisions you can defend.
Add comment