Skip to content
Assurance
Watched

Assurance

Proves: one number you can take to the board — and the decomposed reliability behind it. Every figure is shown beside the single-judge baseline, with its denominator.

Trust score
100%
unweighted mean of four checks
provenance 100% (21 facts) · hold-in-copy 0 · ungated 0/3 selected · traps 37/37
Blends a structural signal (ungated-never-selected, from the illustrative bandit) with a statistical one (trap catch-rate) — not a single empirical measurement.
Trap catch-rate vs single judge
37/37
judge 9/37 (24.3%) · +75.7 pts / 4.11× · 95% CI [91–100%]
False-reject (clean)
0/10
0 clean blocked · CI upper ~27.8% — n=10, weakest evidence
Missed bad claims
0 vs 28
Gate vs single judge — of 37 bad claims
Calibration (ECE)
0.000
decisive verifier · correct-when-confident · 2 of 5 bins populated
Stability
1/1 runs
17/17 golden · 5/5 properties

Source: data/demo/runs/assurance.json · single seeded replay (decision R5, byte-identical) · n=37 traps / 10 clean. · How this was built →

Business KPIs · progress toward target Optimizing ⟳ Simulated RL · synthetic traffic

The board view: the trust number above, and the business KPIs the verified optimizer is moving toward target — it can only win with variants that already cleared the Gate. Simulated outcomes; the full RL loop runs on the live-demo control tower.

acquisition

Bounce rate 43.8% · 89% to target
base 58.0% ↓ target 42.0%
Hero CTR 4.4% · 8% to target
base 4.0% ↑ target 9.0%
Lead → application 20.1% · 81% to target
base 12.0% ↑ target 22.0%

activation

Dormant reactivation 7.8% · 20% to target
base 6.0% ↑ target 15.0%

revenue

Paid conversion 5.5% · 58% to target
base 3.5% ↑ target 7.0%
B2B demo requests 2.1 · 3% to target
base 2.0 ↑ target 6.0 per 1k
Overview
By mutation
By channel
Trap ledger
Operating point — catch-rate at fixed false-reject 02550751000102030★ ideal (0% FR · 100% catch)Gatesingle judgefalse-reject % · n=10 cleancatch-rate % · n=37 traps

Higher catch at the same 0% false-reject — apples-to-apples (P5). Two real points, no threshold sweep exists, so no curve is drawn between them. Whiskers are Wilson 95% CIs.

Confusion — Gate vs single judge Gateblockedpassedtrapclean37caught0missed0false-rej10correctn=47Single judgeblockedpassedtrapclean9caught28missed0false-rej10correctn=47

Raw counts (sum 47). A single LLM judge let 28 of 37 bad claims through; the Gate let 0. The baseline's missed cell is the information — a lone Gate matrix would just be two zeros.

Catch-rate by attack type — Gate vs single judge 0%50%100%Number drift7/70/7Unsupported superlative10/104/10False equivalence10/104/10True but unsayable10/101/10Gatesingle judge

The Gate clears every family; the story is the baseline's holes — a number-blind judge catches 0/7 number-drift traps. Per-type n=7–10, so decimals are coarse; the pattern is the durable claim.

Calibration (reliability) n=37n=0n=0n=0n=10confidence →accuracyECE 0.000 · 2 of 5 bins

Decisive verifier: confidence concentrates at 0/1, so only 2 of 5 bins populate and the mid-range is empty by design (not missing data). ECE 0.000 means correct-when-confident, not graded-probability calibration.

Same set (37 traps, 10 clean), same false-reject (0%) — the Gate catches 4.11× what a number-blind judge does, with no calibration line invented through data it doesn't have.
Stability over time · regression safety 1/1 runs green

2026-06-28 → 2026-06-28 — zero regression. Deterministic seeded replay (R5): flat by construction = harness non-regression, not 1 independent samples.

Gate catch-rate100%no changeSingle-judge catch24.3%no changeFalse-reject0%no changeCalibration ECE0.000no changeGolden evals17/17no changeProperties P1–P55/5no change
The five headline properties
P1P2P3P4P5
P1A Gate-blocked lie can never be selected
P2Legal-hold claim blocked the instant the hold flips (rules_version)
P3A drift event re-verifies exactly the affected claims
P4Website renders only Gate-passed claims — same verdict on both channels
P5Catch-rate beats the single-judge baseline at a fixed false-reject

5/5 green every run — proven in tests/ against the real Gate (no mocks).

Golden coverage — cases per pipeline step
Claims Library
1/1
The Gate — claim verification
7/7
The Enrichment Gate — fact verification
5/5
Optimizer — bandit over verified arms
1/1
Drift Monitor — surgical re-verify
1/1
Assurance Lab — Gate vs single judge
1/1
Website channel — same Gate, both channels
1/1

Bar = case count (all pass, so pass-rate would be a wall of full bars). Gate (7) and enrich-gate carry the load; 5 steps have only 1 golden case — hand-authored regression cases, not exhaustive coverage. Thin coverage is the roadmap.

lifecycle ✓ end-to-end: Maria Chen — form → enrich → Gate → personalize (end to end). Full per-step graph logs at /observatory · /api/observe/golden.

Optimizer behaviour ⟳ Simulated RL · synthetic traffic

Simulated reinforcement learning — deterministic seeded bandit (real ThompsonBandit), synthetic traffic, no live users. Rewards are illustrative latent CTRs. (seed 1729, 3000 rounds). These are simulated outcomes, not achieved results. The one measured fact here is the blocked arm: selected in every scenario (P1 — an invariant from the real ThompsonBandit, not a simulated CTR).

Bandit convergence — share of pulls on the eventual winner
10New anonymous viewerwinner A1 · 89% · blocked 0×rounds →10Existing customer (email match)winner B2 · 58% · blocked 0×rounds →10Emailed lead who clickedwinner C1 · 80% · blocked 0×rounds →

Each panel: the winner's traffic share climbing as the bandit learns; the dashed grey line is the ungated arm, flat at 0 (never selectable).

The bandit’s reward projects onto the Business KPIs shown at the top of this page — fill = progress to target. (Same KPI-tree component as the live-demo control tower.)

Drift watch illustrative source

6/6 TTL-governed facts fresh · 0 paused. On a source change, Drift re-verifies exactly the affected claims and pauses only the dependent variant (P3 — surgical, not whole-corpus).

FactSourceUsed byStatus
reverse-IP → employerBought (reverse-IP)B2B peer-logo hero fresh
company industryBought (reverse-IP)B2B peer-logo hero fresh
modeled incomeBroker appendTailored-offer module fresh
PDL seniorityEnrichment (PDL)Prestige / peer hero fresh
PDL industryEnrichment (PDL)Prestige / peer hero fresh
PDL roleEnrichment (PDL)Click-continuity fresh