L1–L4 — Evaluation
Ledger
The signed per-decision record that turns each governed AI answer into an auditable, replayable artifact. One backed claim. One verified record.
The Claim Ledger is the per-decision evidence record at the core of the Kenshiki platform. For every governed AI response, the Ledger records the authorized evidence retrieved, the claims emitted by the model, the verification result for each claim, and the gating decision — signed and chainable. Without the Ledger, you score responses as monoliths: one backed claim hides three fabricated ones, and you cannot tell auditors which assertion failed or why.
Without this: you score responses as monoliths. One backed claim hides three fabricated ones. You cannot tell auditors which assertion failed or why.
How Ledger turns a model proposal into a gate-ready record
Read this left to right from the Kadai proposal. A generated answer enters with the evidence that was in scope, Ledger breaks it into claims, checks those claims against the evidence, and emits a structured evaluation record for Gate. Ledger does not retrieve evidence, and it does not make the final emission decision.
How evaluation works
Decomposes every response into atomic claims, then runs each through a multi-layer evaluation pipeline. The differentiator is contrastive causal attribution — measuring whether evidence actually caused a claim, not just whether it appeared nearby.
- Claim extraction: splits output into atomic assertions with IDs and types (all tiers)
- L1 confidence signals: token logprobs → calibration-aware support signals (all tiers)
- L2 source entailment: each claim scored against evidence via embedding similarity + NLI (all tiers)
- L3 stability: multi-draw regeneration and semantic clustering for reproducibility (all tiers with deterministic sampling)
- L4 representation uncertainty: hidden-state probes for internal volatility not visible in token confidence (Refinery/Clean Room only)
Who this is for
Governance runtime
executes automatically on every governed request — decomposes, scores, and gates without manual review.
Compliance officers and auditors
consume per-claim audit trails. Each claim links to the evidence it was checked against, the score it received, and the reason codes Gate later enforces.
Ledger — the integrity-protected inference audit trail — decomposes every model output into atomic claims, evaluates each against evidence from Kura — the governed evidence store —, and records the evaluation chain. Gate — the emission policy boundary — reads the evaluation record to make deterministic emission decisions. Per-claim evaluation records survive across agent and runtime boundaries.