How the Assurance dashboard was built
The research and rationale behind /assurance — a decomposed-reliability dashboard where every number comes from a real run. · 6 min read
The Assurance dashboard began as a single trust score and a drift table. It was rebuilt into a decomposed-reliability dashboard — the kind a risk committee and an ML engineer can both read — without inventing a single number. This article records the research and the reasoning behind it.
The research
A four-lens specialist review was run, each grounded in the real data the pipeline already produces (data/demo/runs/assurance.json, the eval history, the golden suite) and in the Quiet Workspace design system:
- an ML-eval / calibration data scientist — what decomposed-reliability panel an NLI-ensemble verifier needs (operating point, confusion, per-mutation lift, calibration / ECE) and how to render a near-perfect result honestly;
- a dashboard / information-design lens — the two-audience information architecture (one board number plus an engineer's drill-down) expressed in quiet.css components;
- a trust & audit lens — what makes the number defensible to a regulator: the baseline contrast, the trap taxonomy, sample sizes, the structural guarantees;
- an SVG dataviz lens — concrete, dependency-free inline-SVG charts that honour the design tokens.
Their recommendations were synthesised into one build spec in which every chart and KPI maps to a real field in a real artifact. Two further adversarial review passes (honesty, data-science, design, frontend) then verified the build and triaged the fixes.
The honesty discipline
The product's whole thesis is “can't say what it can't prove,” so the dashboard is held to the same bar (CONSTITUTION Articles I & VIII): every metric is computed from a real, seeded run — never hand-typed. Confidence intervals are computed (Wilson) in app/charts.py, not written into the HTML. A saturated rate is never shown naked — it always carries its denominator and the single-judge contrast.
The chart catalogue
| Operating point | catch-rate vs false-reject, Gate vs single judge — from gate/baseline in assurance.json, with Wilson whiskers |
| Confusion pair | raw counts {37,0,0,10} vs {9,28,0,10}, derived from the 47 per-trap items[] |
| Lift bars | catch-rate by mutation family (number-drift, superlative, false-equivalence, unsayable) with real per-type denominators |
| Reliability + ECE | the 5 calibration bins + ECE, rendered for a decisive verifier |
| Trap heat-grid | per-claim × attack — where the single judge misses |
| Stability strip | 7 seeded runs over time — flat by construction (decision R5) |
| Bandit + KPI tree | the real ThompsonBandit on synthetic traffic — fenced and labelled illustrative |
How it's verified — and kept aligned
- The eval harness is immutable (CONSTITUTION Article VII) — all trust properties stay green in
tests/; the dashboard adapts to the tests, never the reverse. - Every derived count (confusion, per-type fractions, CIs) is recomputed from
items[]at request time, so a stale number is impossible. - Shared dashboard primitives (
q-sec,q-sim,q-kpitree) live in quiet.css so /assurance and the KPI control tower render the same components.
See it live: the Assurance dashboard. For the engine it audits, see The Assurance Lab.