L1–L4 — Evaluation
Claim Ledger
Prove what the system is willing to say.
Breaks the answer into claims and checks each one against the evidence. What is supported goes through. What is not gets held back. Every response carries a per-claim audit trail — not a score, but the evidence each claim was checked against and why it passed or failed.
Without this: you score responses as monoliths. One backed claim hides three fabricated ones. You cannot tell auditors which assertion failed or why.
How evaluation works
Decomposes every response into atomic claims, then runs each through a multi-layer evaluation pipeline. The differentiator is contrastive causal attribution — measuring whether evidence actually caused a claim, not just whether it appeared nearby.
- Claim extraction: splits output into atomic assertions with IDs and types (all tiers)
- L1 calibrated confidence: token logprobs → calibrated correctness probabilities (all tiers)
- L2 source entailment: each claim scored against evidence via embedding similarity + NLI (all tiers)
- L3 stability: multi-draw regeneration and semantic clustering for reproducibility (all tiers with deterministic sampling)
- L4 representation uncertainty: hidden-state probes for internal volatility not visible in token confidence (Refinery/Clean Room only)
Who this is for
Governance runtime
executes automatically on every governed request — decomposes, scores, and gates without manual review.
Compliance officers and auditors
consume per-claim audit trails. Each claim links to the evidence it was checked against, the score it received, and the gate decision it triggered.