HelpMethodology

How the Assurance dashboard was built

The research and rationale behind /assurance — a decomposed-reliability dashboard where every number comes from a real run. · 6 min read

The Assurance dashboard began as a single trust score and a drift table. It was rebuilt into a decomposed-reliability dashboard — the kind a risk committee and an ML engineer can both read — without inventing a single number. This article records the research and the reasoning behind it.

The research

A four-lens specialist review was run, each grounded in the real data the pipeline already produces (data/demo/runs/assurance.json, the eval history, the golden suite) and in the Quiet Workspace design system:

an ML-eval / calibration data scientist — what decomposed-reliability panel an NLI-ensemble verifier needs (operating point, confusion, per-mutation lift, calibration / ECE) and how to render a near-perfect result honestly;
a dashboard / information-design lens — the two-audience information architecture (one board number plus an engineer's drill-down) expressed in quiet.css components;
a trust & audit lens — what makes the number defensible to a regulator: the baseline contrast, the trap taxonomy, sample sizes, the structural guarantees;
an SVG dataviz lens — concrete, dependency-free inline-SVG charts that honour the design tokens.

Their recommendations were synthesised into one build spec in which every chart and KPI maps to a real field in a real artifact. Two further adversarial review passes (honesty, data-science, design, frontend) then verified the build and triaged the fixes.

The honesty discipline

The product's whole thesis is “can't say what it can't prove,” so the dashboard is held to the same bar (CONSTITUTION Articles I & VIII): every metric is computed from a real, seeded run — never hand-typed. Confidence intervals are computed (Wilson) in app/charts.py, not written into the HTML. A saturated rate is never shown naked — it always carries its denominator and the single-judge contrast.

Because the Gate is perfect on the held-out trap set (37/37 caught, 0 false-rejects, ECE 0), a lone “100%” would read as fake. So the page leads with the contrast — the Gate catches 37/37 where a number-blind single judge catches 9/37 — and renders the calibration of a decisive verifier honestly (only 2 of 5 reliability bins populate; no curve is drawn through the empty middle).

The chart catalogue

every chart ← a real field

Operating point	catch-rate vs false-reject, Gate vs single judge — from `gate`/`baseline` in assurance.json, with Wilson whiskers
Confusion pair	raw counts {37,0,0,10} vs {9,28,0,10}, derived from the 47 per-trap `items[]`
Lift bars	catch-rate by mutation family (number-drift, superlative, false-equivalence, unsayable) with real per-type denominators
Reliability + ECE	the 5 calibration bins + ECE, rendered for a decisive verifier
Trap heat-grid	per-claim × attack — where the single judge misses
Stability strip	7 seeded runs over time — flat by construction (decision R5)
Bandit + KPI tree	the real ThompsonBandit on synthetic traffic — fenced and labelled illustrative

How it's verified — and kept aligned

The eval harness is immutable (CONSTITUTION Article VII) — all trust properties stay green in tests/; the dashboard adapts to the tests, never the reverse.
Every derived count (confusion, per-type fractions, CIs) is recomputed from items[] at request time, so a stale number is impossible.
Shared dashboard primitives (q-sec, q-sim, q-kpitree) live in quiet.css so /assurance and the KPI control tower render the same components.

See it live: the Assurance dashboard. For the engine it audits, see The Assurance Lab.

The Assurance LabAn adversarial wind tunnel that mutates approved claims into labeled traps and runs them through the real Gate versus a single number-blind judge, reporting decomposed reliability. The five trust propertiesP1 through P5 plus E1 — the headline properties that must be provably true, each tied to the test file that proves it. The Quiet Workspace design systemquiet.css is a single light, near-monochrome token set where colour appears only in product data; pages compose its components and never restyle them, and charts are dependency-free inline SVG. How this was builtProvenance is built to the same bar it holds outbound copy: a Constitution, a locked PRD, deterministic offline replay, and an immutable pytest harness that is the authoritative gate.

← Previous

How this was built

The Quiet Workspace design system