Watched

Assurance

Proves: one number you can take to the board — and the decomposed reliability behind it. Every figure is shown beside the single-judge baseline, with its denominator.

Trust score

100%

unweighted mean of four checks

provenance 100% (21 facts) · hold-in-copy 0 · ungated 0/3 selected · traps 37/37

Blends a structural signal (ungated-never-selected, from the illustrative bandit) with a statistical one (trap catch-rate) — not a single empirical measurement.

Trap catch-rate vs single judge

37/37

judge 9/37 (24.3%) · +75.7 pts / 4.11× · 95% CI [91–100%]

False-reject (clean)

0/10

0 clean blocked · CI upper ~27.8% — n=10, weakest evidence

Missed bad claims

0 vs 28

Gate vs single judge — of 37 bad claims

Calibration (ECE)

0.000

decisive verifier · correct-when-confident · 2 of 5 bins populated

Stability

1/1 runs

17/17 golden · 5/5 properties

Source: data/demo/runs/assurance.json · single seeded replay (decision R5, byte-identical) · n=37 traps / 10 clean. · How this was built →

Business KPIs · progress toward target Optimizing ⟳ Simulated RL · synthetic traffic

The board view: the trust number above, and the business KPIs the verified optimizer is moving toward target — it can only win with variants that already cleared the Gate. Simulated outcomes; the full RL loop runs on the live-demo control tower.

acquisition

Bounce rate 43.8% · 89% to target

base 58.0% ↓ target 42.0%

Hero CTR 4.4% · 8% to target

base 4.0% ↑ target 9.0%

Lead → application 20.1% · 81% to target

base 12.0% ↑ target 22.0%

activation

Dormant reactivation 7.8% · 20% to target

base 6.0% ↑ target 15.0%

revenue

Paid conversion 5.5% · 58% to target

base 3.5% ↑ target 7.0%

B2B demo requests 2.1 · 3% to target

base 2.0 ↑ target 6.0 per 1k

Overview

By mutation

By channel

Trap ledger

Operating point — catch-rate at fixed false-reject

Higher catch at the same 0% false-reject — apples-to-apples (P5). Two real points, no threshold sweep exists, so no curve is drawn between them. Whiskers are Wilson 95% CIs.

Confusion — Gate vs single judge

Raw counts (sum 47). A single LLM judge let 28 of 37 bad claims through; the Gate let 0. The baseline's missed cell is the information — a lone Gate matrix would just be two zeros.

Catch-rate by attack type — Gate vs single judge

The Gate clears every family; the story is the baseline's holes — a number-blind judge catches 0/7 number-drift traps. Per-type n=7–10, so decimals are coarse; the pattern is the durable claim.

Calibration (reliability)

Decisive verifier: confidence concentrates at 0/1, so only 2 of 5 bins populate and the mid-range is empty by design (not missing data). ECE 0.000 means correct-when-confident, not graded-probability calibration.

Same set (37 traps, 10 clean), same false-reject (0%) — the Gate catches 4.11× what a number-blind judge does, with no calibration line invented through data it doesn't have.

Catch-rate by attack type — Gate vs single judge

Lift per family: number drift +100pp · unsupported superlative +60pp · false equivalence +60pp · true but unsayable +90pp.

Single judge — per claim × attack (where it misses)

✓ caught · ✗ missed · hatched = no trap for that claim×family (number-drift has only 7 of 10). The Gate's grid is all-caught (37/37); shown here is the single judge, so its red cells are visible.

By severity — material vs puffery

material

Gate 27/27 · judge 5/27

puffery

Gate 10/10 · judge 4/10

Green bar = Gate; amber tick = single judge (derived from items[], not shipped in assurance.json). The Gate catches every material (number/legal) trap; the single judge catches barely 1 in 5.

Trap taxonomy

Family	n	What it mutates
number drift	7	altered figures (a number-blind judge structurally can't verify arithmetic)
unsupported superlative	10	puffery — an unbacked best/most/leading claim
false equivalence	10	a misleading comparison between unlike things
true but unsayable	10	accurate but compliance-barred (e.g. a claim under legal hold)
clean (held-out)	10	genuine sayable claims — must not be blocked (the false-reject set)

Self-authored adversarial suite, n=47 (37 traps + 10 clean). Widening it — especially the clean set — is the clearest next eval.

Catch-rate per channel

Zero divergence between channels — an invariant, not corroboration.

Per-channel slice

Channel	traps	clean	Gate catch	false-reject	judge catch
email	37	10	100.0%	0.0%	24.3%
website	37	10	100.0%	0.0%	24.3%

Every field identical across channels — proof the boundary is shared, not duplicated.

Trap	Family	Severity	Source claim	Kind	Verdict	conf	Gate call	judge score	Judge call
clean_c_deployed	clean	clean	c_deployed	clean	green	0.90	✓	1.00	✓
clean_c_ehr	clean	clean	c_ehr	clean	green	0.70	✓	0.60	✓
clean_c_encrypt	clean	clean	c_encrypt	clean	amber	0.57	✓	0.57	✓
clean_c_hipaa	clean	clean	c_hipaa	clean	green	0.85	✓	1.00	✓
clean_c_los	clean	clean	c_los	clean	amber	0.57	✓	0.57	✓
clean_c_nofee	clean	clean	c_nofee	clean	green	0.78	✓	0.75	✓
clean_c_price	clean	clean	c_price	clean	green	0.90	✓	1.00	✓
clean_c_soc2	clean	clean	c_soc2	clean	green	0.90	✓	1.00	✓
clean_c_speed	clean	clean	c_speed	clean	green	0.78	✓	0.75	✓
clean_c_tco	clean	clean	c_tco	clean	amber	0.50	✓	0.50	✓
fe_c_deployed	false equivalence	material	c_deployed	bad	red	0.02	✓	0.56	✗
fe_c_ehr	false equivalence	material	c_ehr	bad	red	0.02	✓	0.33	✓
fe_c_encrypt	false equivalence	material	c_encrypt	bad	red	0.02	✓	0.36	✓
fe_c_hipaa	false equivalence	material	c_hipaa	bad	red	0.02	✓	0.50	✗
fe_c_los	false equivalence	material	c_los	bad	red	0.02	✓	0.36	✓
fe_c_nofee	false equivalence	material	c_nofee	bad	red	0.02	✓	0.50	✗
fe_c_price	false equivalence	material	c_price	bad	red	0.02	✓	0.56	✗
fe_c_soc2	false equivalence	material	c_soc2	bad	red	0.02	✓	0.60	✗
fe_c_speed	false equivalence	material	c_speed	bad	red	0.02	✓	0.50	✗
fe_c_tco	false equivalence	material	c_tco	bad	red	0.02	✓	0.30	✓
num_c_deployed	number drift	material	c_deployed	bad	red	0.05	✓	1.00	✗
num_c_ehr	number drift	material	c_ehr	bad	red	0.05	✓	0.60	✗
num_c_los	number drift	material	c_los	bad	red	0.05	✓	0.57	✗
num_c_nofee	number drift	material	c_nofee	bad	red	0.05	✓	0.75	✗
num_c_price	number drift	material	c_price	bad	red	0.05	✓	1.00	✗
num_c_speed	number drift	material	c_speed	bad	red	0.05	✓	0.75	✗
num_c_tco	number drift	material	c_tco	bad	red	0.05	✓	0.50	✗
sup_c_deployed	unsupported superlative	puffery	c_deployed	bad	red	0.02	✓	0.56	✗
sup_c_ehr	unsupported superlative	puffery	c_ehr	bad	red	0.02	✓	0.33	✓
sup_c_encrypt	unsupported superlative	puffery	c_encrypt	bad	red	0.02	✓	0.36	✓
sup_c_hipaa	unsupported superlative	puffery	c_hipaa	bad	red	0.02	✓	0.50	✗
sup_c_los	unsupported superlative	puffery	c_los	bad	red	0.02	✓	0.36	✓
sup_c_nofee	unsupported superlative	puffery	c_nofee	bad	red	0.02	✓	0.50	✗
sup_c_price	unsupported superlative	puffery	c_price	bad	red	0.02	✓	0.56	✗
sup_c_soc2	unsupported superlative	puffery	c_soc2	bad	red	0.02	✓	0.60	✗
sup_c_speed	unsupported superlative	puffery	c_speed	bad	red	0.02	✓	0.58	✗
sup_c_tco	unsupported superlative	puffery	c_tco	bad	red	0.02	✓	0.30	✓
tbu_c_deployed	true but unsayable	material	c_deployed	bad	red	0.02	✓	0.83	✗
tbu_c_ehr	true but unsayable	material	c_ehr	bad	red	0.02	✓	0.50	✗
tbu_c_encrypt	true but unsayable	material	c_encrypt	bad	red	0.02	✓	0.50	✗
tbu_c_hipaa	true but unsayable	material	c_hipaa	bad	red	0.02	✓	0.80	✗
tbu_c_los	true but unsayable	material	c_los	bad	red	0.02	✓	0.50	✗
tbu_c_nofee	true but unsayable	material	c_nofee	bad	red	0.02	✓	0.67	✗
tbu_c_price	true but unsayable	material	c_price	bad	red	0.02	✓	0.83	✗
tbu_c_soc2	true but unsayable	material	c_soc2	bad	red	0.02	✓	0.86	✗
tbu_c_speed	true but unsayable	material	c_speed	bad	red	0.02	✓	0.67	✗
tbu_c_tco	true but unsayable	material	c_tco	bad	red	0.02	✓	0.43	✓

✓ = correct call (a trap caught, or a clean claim correctly allowed) · ✗ = wrong call. The single judge makes 28 wrong calls — all misses; the Gate makes 0. Highlighted rows are the lift: the Gate caught them, the judge missed. Clean rows are correctly never blocked (amber ≠ reject), so false-reject = 0.

Stability over time · regression safety 1/1 runs green

2026-06-28 → 2026-06-28 — zero regression. Deterministic seeded replay (R5): flat by construction = harness non-regression, not 1 independent samples.

The five headline properties

P1P2P3P4P5

P1	A Gate-blocked lie can never be selected
P2	Legal-hold claim blocked the instant the hold flips (rules_version)
P3	A drift event re-verifies exactly the affected claims
P4	Website renders only Gate-passed claims — same verdict on both channels
P5	Catch-rate beats the single-judge baseline at a fixed false-reject

5/5 green every run — proven in tests/ against the real Gate (no mocks).

Golden coverage — cases per pipeline step

Claims Library

1/1

The Gate — claim verification

7/7

The Enrichment Gate — fact verification

5/5

Optimizer — bandit over verified arms

1/1

Drift Monitor — surgical re-verify

1/1

Assurance Lab — Gate vs single judge

1/1

Website channel — same Gate, both channels

1/1

Bar = case count (all pass, so pass-rate would be a wall of full bars). Gate (7) and enrich-gate carry the load; 5 steps have only 1 golden case — hand-authored regression cases, not exhaustive coverage. Thin coverage is the roadmap.

lifecycle ✓ end-to-end: Maria Chen — form → enrich → Gate → personalize (end to end). Full per-step graph logs at /observatory · /api/observe/golden.

Optimizer behaviour ⟳ Simulated RL · synthetic traffic

Simulated reinforcement learning — deterministic seeded bandit (real ThompsonBandit), synthetic traffic, no live users. Rewards are illustrative latent CTRs. (seed 1729, 3000 rounds). These are simulated outcomes, not achieved results. The one measured fact here is the blocked arm: selected 0× in every scenario (P1 — an invariant from the real ThompsonBandit, not a simulated CTR).

Bandit convergence — share of pulls on the eventual winner

Each panel: the winner's traffic share climbing as the bandit learns; the dashed grey line is the ungated arm, flat at 0 (never selectable).

The bandit’s reward projects onto the Business KPIs shown at the top of this page — fill = progress to target. (Same KPI-tree component as the live-demo control tower.)

Drift watch illustrative source

6/6 TTL-governed facts fresh · 0 paused. On a source change, Drift re-verifies exactly the affected claims and pauses only the dependent variant (P3 — surgical, not whole-corpus).

Fact	Source	Used by	Status
reverse-IP → employer	Bought (reverse-IP)	B2B peer-logo hero	fresh
company industry	Bought (reverse-IP)	B2B peer-logo hero	fresh
modeled income	Broker append	Tailored-offer module	fresh
PDL seniority	Enrichment (PDL)	Prestige / peer hero	fresh
PDL industry	Enrichment (PDL)	Prestige / peer hero	fresh
PDL role	Enrichment (PDL)	Click-continuity	fresh