Kenshiki Labs

L1–L4 — Evaluation

Ledger

The signed per-decision record that turns each governed AI answer into an auditable, replayable artifact. One backed claim. One verified record.

The Claim Ledger is the per-decision evidence record at the core of the Kenshiki platform. For every governed AI response, the Ledger records the authorized evidence retrieved, the claims emitted by the model, the verification result for each claim, and the gating decision — signed and chainable. Without the Ledger, you score responses as monoliths: one backed claim hides three fabricated ones, and you cannot tell auditors which assertion failed or why.

Without this: you score responses as monoliths. One backed claim hides three fabricated ones. You cannot tell auditors which assertion failed or why.

How Ledger turns a model proposal into a gate-ready record

Read this left to right from the Kadai proposal. A generated answer enters with the evidence that was in scope, Ledger breaks it into claims, checks those claims against the evidence, and emits a structured evaluation record for Gate. Ledger does not retrieve evidence, and it does not make the final emission decision.

Ledger Claim Verification Lifecycle
A model proposal enters with in-scope evidence, then Ledger decomposes, compares, scores, and records each claim before Gate decides whether anything may be emitted.
Step 1 of 4Receive
Step 2 of 4Decompose
Step 3 of 4Compare
Step 4 of 4Record
Every Ledger Run Produces

How evaluation works

Decomposes every response into atomic claims, then runs each through a multi-layer evaluation pipeline. The differentiator is contrastive causal attribution — measuring whether evidence actually caused a claim, not just whether it appeared nearby.

  • Claim extraction: splits output into atomic assertions with IDs and types (all tiers)
  • L1 confidence signals: token logprobs → calibration-aware support signals (all tiers)
  • L2 source entailment: each claim scored against evidence via embedding similarity + NLI (all tiers)
  • L3 stability: multi-draw regeneration and semantic clustering for reproducibility (all tiers with deterministic sampling)
  • L4 representation uncertainty: hidden-state probes for internal volatility not visible in token confidence (Refinery/Clean Room only)

Who this is for

Governance runtime

executes automatically on every governed request — decomposes, scores, and gates without manual review.

Compliance officers and auditors

consume per-claim audit trails. Each claim links to the evidence it was checked against, the score it received, and the reason codes Gate later enforces.

Ledger — the integrity-protected inference audit trail — decomposes every model output into atomic claims, evaluates each against evidence from Kura — the governed evidence store —, and records the evaluation chain. Gate — the emission policy boundary — reads the evaluation record to make deterministic emission decisions. Per-claim evaluation records survive across agent and runtime boundaries.