Runtime AI Governance
What your current stack cannot prove.
Runtime AI governance is the doctrine that closes the gap between AI capability and AI defensibility. It rests on two boundary invariants — REBAC for authorized evidence access and ARBV for boundary resilience under adversarial pressure — bound into one signed per-decision artifact, the Claim Ledger entry. This page defines the doctrine, names the state-of-the-art enterprise AI stack (Azure OpenAI, Bedrock, Anthropic, Pinecone, LangChain, Lakera, Datadog), and itemizes the seven supporting evidence artifacts that none of those layers produces by default — explaining why a credible AI deployment in 2026 needs a governance layer designed for runtime, not retrofit.
You shipped the AI features — OpenAI or Bedrock for the model, Pinecone or Weaviate for retrieval, LangChain or LlamaIndex for orchestration, Bedrock Guardrails or Lakera for safety, Datadog or LangSmith for traces. The boxes are checked.
The controls that helped you ship AI are not the controls that let you defend AI decisions.
Most enterprise AI governance programs run on an implicit equation — governance = access control + guardrails + observability. Those three are necessary. They are not sufficient. The forcing functions on your horizon ask for a different kind of proof.
The doctrine, in three lines:
REBAC defines who may know. ARBV proves the boundary still holds. The Claim Ledger proves what happened.
REBAC is the access invariant: a per-decision proof that this caller was authorized to retrieve this evidence at this time. ARBV is the resilience invariant: continuous, signed evidence that the boundary holds under semantic, retrieval, model, and policy pressure. The Claim Ledger is the custody artifact: caller, evidence, prompt, model, claims, gates, and output bound into one signed record an outside party can verify. Underneath sit supporting invariants — provenance, scope, grounding, gate state, replay, tamper-evidence, waiver control — which together make the two boundary invariants enforceable.
One-question diagnostic. Pick a single AI-assisted decision from the last two weeks. Produce the signed record. If your team reconstructs it from Datadog, LangSmith, CloudTrail, and a few screenshots, you do not have runtime governance — you have forensic storytelling. The uncomfortable moment is not when the model hallucinates; it is when you open the four tools your governance program is built on, one by one, and realize none of them holds the artifact the buyer just asked for.
The canonical 2026 enterprise AI stack.
If you have shipped AI features in the last twelve months, you are running one of two recipes. Many large enterprises run both, on different products inside the same company.
The hosted-frontier recipe. Model: Azure OpenAI, AWS Bedrock, Anthropic API, Google Vertex AI, or OpenRouter. Retrieval: Pinecone, Weaviate, Qdrant, pgvector, Bedrock Knowledge Bases, Azure AI Search, Vertex AI Search. Orchestration: LangChain, LlamaIndex, Vercel AI SDK, Semantic Kernel. Guardrail: Bedrock Guardrails, Lakera, PromptArmor, Azure AI Content Safety, OpenAI Moderation, Llama Guard. Observability: LangSmith, LangFuse, Datadog LLM Observability, OpenTelemetry GenAI. Authorization: IAM at the gateway, RBAC at the application controller.
The embedded-copilot recipe. Microsoft 365 Copilot, Glean, Notion AI, GitHub Copilot for Business, Salesforce Einstein Copilot. Authorization deferred to the host platform — SharePoint ACLs, Drive scopes, source-system grants. Retrieval scope is the platform’s tenant index.
These are state of the art, not legacy. Both shipped this year. Neither is broken — they do exactly what their architectures permit. The page below is about what those architectures do not permit, by default, regardless of how cleanly the components are wired or how recently they were upgraded.
The boundary your stack secures, vs. the one the audit asks about.
The stack above tracks the packaging boundary — who called the model, from what IP, with which key, against which guardrail. That answers “was this request authorized to talk to the model?” — a real, important question.
The questions on subpoenas, audit walkthroughs, and renewal-blocking diligence asks are about the evidence boundary — what the model knew, what it claimed, what supported the claim, and why the answer was allowed to leave. Packaging logs do not contain those answers. The model itself does not record them. The two boundaries use different primitives and produce different artifacts: a perfect packaging boundary still emits ungrounded claims; a perfect evidence boundary still needs the packaging boundary to keep unauthorized callers out.
REBAC and ARBV operate inside the evidence boundary. RBAC and IAM operate at the packaging boundary. They are not substitutes.
Things that look like AI governance but aren’t.
Five sentences enterprise teams use to describe their governance posture. None is a substitute for a per-decision evidence record. Each is a useful operational practice in its own right.
“We use Bedrock Guardrails / Lakera / Azure AI Content Safety.” Content filters block PII, denied topics, and prompt-injection patterns. They do not evaluate output against the evidence boundary, do not produce per-claim grounding, and do not sign decisions. They catch abuse, not chain of custody.
“We have LLM observability — LangSmith, LangFuse, Datadog — plus CloudTrail.” Traces and CloudTrail are unsigned, mutable, and not bound together as one artifact. They show what went in and out; they do not record per-claim grounding, gate state, or per-document authorization decisions.
“Our retrieval is Pinecone / Weaviate / pgvector with namespaces, plus IAM at the gateway.” Tenant separation and API-level identity are real, but neither is a per-document, per-caller authorization decision recorded against the specific evidence that landed in the prompt. Top-K is a relevance ranking, not a permission boundary. RBAC at the gateway is not REBAC at retrieval.
“We rerun the prompt to verify the output.” Reruns produce a fresh response from a non-deterministic model against an index that may have been rebuilt since the original call. Replay means same evidence pool, same retrieval result, same gate decisions, same output — verifiable against a signed record. A rerun is not replay.
“M365 Copilot / Glean — the platform handles governance.” Microsoft and Glean enforce tenant ACLs at retrieval, which is a real authorization boundary. They do not produce a per-decision signed record that ties caller, evidence, claim, and gate together. The fact that Copilot respected SharePoint permissions is not, on its own, the artifact your auditor is asking for.
If your governance program is built on these, you have operations, monitoring, and abuse prevention. None of those produces REBAC, ARBV, or a signed Claim Ledger.
Seven artifacts your stack does not produce by default.
The seven artifacts below are the supporting invariants — the machinery that makes REBAC, ARBV, and the Claim Ledger doctrine enforceable. Run this list against your own architecture. Each item names the invariant, the layer that would have to produce it, what that layer does instead, and what is missing. By default means without additional engineering beyond what the named systems ship with.
1. The scope invariant — which evidence was in scope at the moment of inference.
Layer: retrieval (Pinecone, Weaviate, Qdrant, pgvector, Bedrock Knowledge Bases, Azure AI Search, Vertex AI Search). Default behavior: returns the top-K vectors closest to the query embedding within the namespace — a relevance ranking. What is missing: a record of which documents the specific caller was permitted to retrieve at request time. Provenance is implicit in this invariant: scope cannot exist for chunks without known source, version, and admissibility state. Asked what was in scope for user X at 14:32 UTC, the honest answer is “the entire namespace.”
2. The REBAC manifestation — whether the caller was authorized for that specific evidence.
Layer: orchestration plus retrieval. Default behavior: the controller decides whether to make the call, then trusts retrieval to stay in scope. Bedrock Knowledge Bases inherit IAM at the data-source level. M365 Copilot and Glean defer to SharePoint and source-system ACLs. What is missing: an explicit, signed claim of the form “user X was authorized to retrieve chunk Y because of grant Z, evaluated at time T.” The authorization is implicit; the artifact is not.
3. The source-pool invariant — that the model did not draw on training outside the sanctioned pool.
Layer: the model itself (GPT, Claude, Gemini, Llama, anything via OpenRouter). Default behavior: the model retains its training corpus and answers from a blend of retrieved evidence and parametric memory. What is missing: a per-claim verification that distinguishes evidence-grounded sentences from training-grounded sentences. Without it, “the model said it” and “the evidence said it” look identical in the output.
4. The grounding invariant — that each claim in the response was grounded in retrieved evidence.
Layer: orchestration plus observability (LangSmith, LangFuse). Default behavior: traces show prompt and response. Citation features (Anthropic Citations, OpenAI’s response_format with retrieval) surface what the model claims to have cited. What is missing: an independent, per-claim verification that the cited evidence actually supports the sentence — and a record of which sentences passed and which failed.
5. The gate invariant — what the policy gate decided per claim, and why.
Layer: the guardrail tier (Bedrock Guardrails, Lakera, PromptArmor,
Azure AI Content Safety, Llama Guard). Default behavior: one
block-or-allow decision per request, based on PII, denied topics, or
prompt-injection patterns. What is missing: per-claim gate decisions
with outcomes (ALLOW, PARTIAL, REQUIRES_SPEC, BLOCKED) and reasons
traceable to the evidence pool.
6. The replay invariant — that a contested decision can be replayed deterministically.
Layer: the model API plus retrieval. Default behavior: APIs accept seed and temperature parameters, which gets you closer. What is missing: a signed record that freezes evidence pool, retrieval result, prompt composition, and gate decisions — together — against a verifiable hash chain. Without all of those, you can re-run the question; you cannot reproduce the decision.
7. The ledger invariant — chain of custody for question, evidence, answer, and policy state.
Layer: all of the above, bound together. Default behavior: each layer emits its own logs (CloudTrail, CloudWatch, Datadog, LangSmith) into its own store. What is missing: a single signed, tamper-evident, externally verifiable record per decision — caller, scope, retrieval result, prompt, model and version, per-claim grounding, gate decisions, output, hash chain. This is the Claim Ledger doctrine made concrete.
Plus the waiver invariant.
Every governance program has exceptions. A program with no waiver registry has tacit exceptions — which means it has unrecorded boundary failures. Any exception to the boundary must be explicit, owned, expiring, and visible in the audit surface. This is the eighth supporting invariant; it is administrative, not artifact-shaped, but it is load-bearing.
What this looks like in three scenarios.
Three high-frequency moments at which the gap above becomes visible. Each is a tabletop your team can run this quarter without buying anything.
Procurement, this quarter.
The buyer asks: “Provide an example artifact demonstrating runtime evidence enforcement for one AI-assisted decision.” Your team opens: Datadog, then LangSmith, then the Pinecone admin console. Datadog shows uptime and latency. LangSmith shows a prompt and a response with no grounding map. Pinecone shows index health. Why it fails: none of those tools is a per-decision signed record. The questionnaire response reads “we operate AI within governed boundaries.” The buyer’s procurement counsel reads it twice. The deal slips a quarter.
Audit, next quarter.
The auditor asks: for the per-decision evidence record on five sampled AI-assisted decisions. Your team opens: LangSmith traces and a screenshot of the retrieval pipeline. Asked how the traces are tamper-evident, your team produces CloudTrail. Why it fails: CloudTrail logs writes to S3. It does not sign individual decision records, and it does not bind retrieval, prompt, gate, and output to each other. The auditor writes the finding. Remediation goes on the next board pack.
Signed audit replay, within six months.
The customer asks: for signed audit replay of a contested AI-assisted decision under the contract’s replay clause. Your team opens: the runbook. The runbook says rerun the prompt against the live index. Why it fails: the index has been reindexed twice since the original call, the model provider’s internal execution cannot be replayed, and the output is different. The customer pauses the renewal pending an explanation. The contract was worth more than the governance budget would have been.
Some teams have partial controls — metadata filters at retrieval, custom auth wrappers, append-only audit logs, manual claim-citation review. Those controls matter. They are not equivalent to a signed evidence-boundary record unless they bind caller, evidence, claim, gate, and output into one artifact that an outside party can verify without trusting your operations team. The test is not whether the parts exist. The test is whether the artifact exists.
What governed AI actually looks like.
Four doctrines and one engine. Relationship-aware access defines who may know. Evidence attributes define what the evidence may do. ARBV proves the boundary still holds. The Claim Ledger proves what happened. The bounded reasoning runtime is what executes against all four at inference time.
The access invariant: an evidence boundary with provenance, relationships, and attributes.
A governed evidence store records, for every chunk of source material, what it covers, what it relates to, what it must never be used to answer, which callers are authorized to retrieve it, and which document attributes permit the use. Authorization is enforced at retrieval, not at the controller. The boundary is a runtime property, not a documentation claim.
The engine: a bounded reasoning runtime that retrieves only inside the boundary.
The model executes against a context that contains only the evidence the caller was authorized for, scoped to the question. Out-of-scope evidence is not in the retrieval pool. Out-of-scope claims fail grounding at the gate. The runtime enforces the same boundary the evidence store records.
The custody artifact: a Claim Ledger that signs each decision with a public-key-verifiable hash chain.
Every inference produces a signed record: caller, evidence pool, retrieved chunks, prompt composition, model and version, per-claim grounding, gate decisions, output state. The record is hash-chained and signed for third-party verification, so an outside party can inspect it without trusting the operator. The same record enables signed audit replay. It does not claim deterministic replay of a hosted provider’s internal inference environment.
The resilience invariant: an adversarial verification loop (ARBV).
The boundary is tested adversarially on a schedule, against the same threat models a regulator or counterparty will use. Failures produce signed Boundary Evidence Records that show where the system held — or where it gave way and what was changed. Without this loop, the boundary is a promise, not a property.
How Kenshiki Labs maps to that doctrine.
The mapping is diagnostic. Each gap on this page corresponds to one missing layer.
- If the failure is the access invariant — evidence scope or per-caller authorization — the missing layer is Kura. Replaces the slot Pinecone or Bedrock Knowledge Bases occupies, and adds the per-document, per-caller REBAC decision those systems do not record by default.
- If the failure is grounding or per-claim gating, the missing layer is Kadai — the bounded reasoning runtime. Sits in the slot LangChain or LlamaIndex occupies, and adds the per-claim grounding record and gate decision they do not produce.
- If the failure is the custody artifact — replay or chain of custody — the missing layer is the Claim Ledger. Replaces the role LangSmith, LangFuse, and Datadog play, and adds the hash-chained, public-key-verifiable per-decision artifact they do not produce.
- If the failure is the resilience invariant — “we cannot prove the boundary holds under attack” — the missing layer is ARBV. There is no slot for this in the canonical stack. It is new work.
Kenshiki Labs runs on three deployment tiers — Workshop (shared infrastructure), Refinery (your VPC), Clean Room (air-gapped). Governance semantics are identical across tiers. The artifacts they produce are identical too.
The doctrine, restated.
REBAC defines who may know. ARBV proves the boundary still holds. The Claim Ledger proves what happened.
If you cannot retrieve the per-decision record without reconstructing it, you do not have runtime AI governance yet. Until something writes those three artifacts, your AI program is a packaging boundary that hopes the evidence-boundary question never arrives. The forcing functions on your horizon are how the question arrives.
Related concepts
- See your regulatory horizon — When the forcing functions on this page actually arrive at your org
- ARBV — Adversarial Resilience and Boundary Verification — The boundary-resilience invariant — how the boundary is tested under pressure
- Claim Ledger — The signed per-decision artifact that binds caller, evidence, claim, gate, and output
- REBAC (relationship-based access control) — The access invariant — per-caller, per-document evidence authorization at retrieval
- Three-plane architecture — Where runtime governance sits in the larger Kenshiki Labs system