Skip to content
Help
HelpMethodology

How the Assurance dashboard was built

The research and rationale behind /assurance — a decomposed-reliability dashboard where every number comes from a real run. · 6 min read

The Assurance dashboard began as a single trust score and a drift table. It was rebuilt into a decomposed-reliability dashboard — the kind a risk committee and an ML engineer can both read — without inventing a single number. This article records the research and the reasoning behind it.

The research

A four-lens specialist review was run, each grounded in the real data the pipeline already produces (data/demo/runs/assurance.json, the eval history, the golden suite) and in the Quiet Workspace design system:

  • an ML-eval / calibration data scientist — what decomposed-reliability panel an NLI-ensemble verifier needs (operating point, confusion, per-mutation lift, calibration / ECE) and how to render a near-perfect result honestly;
  • a dashboard / information-design lens — the two-audience information architecture (one board number plus an engineer's drill-down) expressed in quiet.css components;
  • a trust & audit lens — what makes the number defensible to a regulator: the baseline contrast, the trap taxonomy, sample sizes, the structural guarantees;
  • an SVG dataviz lens — concrete, dependency-free inline-SVG charts that honour the design tokens.

Their recommendations were synthesised into one build spec in which every chart and KPI maps to a real field in a real artifact. Two further adversarial review passes (honesty, data-science, design, frontend) then verified the build and triaged the fixes.

The honesty discipline

The product's whole thesis is “can't say what it can't prove,” so the dashboard is held to the same bar (CONSTITUTION Articles I & VIII): every metric is computed from a real, seeded run — never hand-typed. Confidence intervals are computed (Wilson) in app/charts.py, not written into the HTML. A saturated rate is never shown naked — it always carries its denominator and the single-judge contrast.

Because the Gate is perfect on the held-out trap set (37/37 caught, 0 false-rejects, ECE 0), a lone “100%” would read as fake. So the page leads with the contrast — the Gate catches 37/37 where a number-blind single judge catches 9/37 — and renders the calibration of a decisive verifier honestly (only 2 of 5 reliability bins populate; no curve is drawn through the empty middle).

The chart catalogue

every chart ← a real field
Operating pointcatch-rate vs false-reject, Gate vs single judge — from gate/baseline in assurance.json, with Wilson whiskers
Confusion pairraw counts {37,0,0,10} vs {9,28,0,10}, derived from the 47 per-trap items[]
Lift barscatch-rate by mutation family (number-drift, superlative, false-equivalence, unsayable) with real per-type denominators
Reliability + ECEthe 5 calibration bins + ECE, rendered for a decisive verifier
Trap heat-gridper-claim × attack — where the single judge misses
Stability strip7 seeded runs over time — flat by construction (decision R5)
Bandit + KPI treethe real ThompsonBandit on synthetic traffic — fenced and labelled illustrative

How it's verified — and kept aligned

  • The eval harness is immutable (CONSTITUTION Article VII) — all trust properties stay green in tests/; the dashboard adapts to the tests, never the reverse.
  • Every derived count (confusion, per-type fractions, CIs) is recomputed from items[] at request time, so a stale number is impossible.
  • Shared dashboard primitives (q-sec, q-sim, q-kpitree) live in quiet.css so /assurance and the KPI control tower render the same components.

See it live: the Assurance dashboard. For the engine it audits, see The Assurance Lab.

Related