Kenshiki Labs

Evidence store

Kura

Stores what counts as real. SIRE-tagged evidence corpus with deterministic identity, retrieval boundaries, and provenance for every chunk.

Kura is the write side of the evidence boundary — the corpus where authoritative source material is codified, indexed, provenance-stamped, and locked down before any AI ever sees it. Every document has SIRE identity (Subject, Included, Relevant, Excluded), every chunk has stable content-addressed identity, and every retrieval scope is computed deterministically against principal authorization. Without Kura, every downstream decision is an assertion without evidence: the Compiler cannot scope what the model sees, the Ledger has nothing to verify against, the Gate has no basis.

Without Kura, every downstream decision is an assertion without evidence. The Compiler cannot scope what the model sees. The Ledger has nothing to check claims against. The Gate has no basis. No evidence, no grounded answer.

How Kura becomes governed model context

Read this left to right. Source material enters the evidence boundary, becomes policy-bearing chunks, and only then becomes bounded retrieval context. Kura stops at that handoff. It does not generate the final answer or verify claims; it makes governed evidence and identity available for downstream orchestration.

Kura Evidence Lifecycle
Source material becomes governed retrieval context before the model sees anything.
Step 1 of 5Source
Step 2 of 5Extract
Step 3 of 5Enrich
Step 4 of 5Store
Step 5 of 5Retrieve
Every Chunk Leaving Kura Carries

Why Kura Exists

Standard RAG retrieves whatever is nearest in embedding space and hands it to the model. No authority boundary, no provenance, no access control, no way to inspect what the model was allowed to see. Governed inference requires a governed evidence boundary.

  • RAG without authority boundaries is retrieval, not governance
  • The model must not see evidence the caller cannot access
  • Post-generation scoring cannot fix what was never in scope
  • Every claim in the Ledger traces back to a specific chunk with provenance

What Kura Does

Transforms source documents into a queryable, tamper-evident knowledge base. Every chunk carries provenance from upload through embedding.

  • SHA-256 source hash, idempotent upsert, version-aware change detection
  • Section-aware chunking on heading boundaries with merge for undersized chunks
  • HMAC-SHA-256 watermarks per chunk — verification without database access
  • Embedding via text-embedding-3-large (512d Matryoshka)
  • Tenant provenance on every row, enforced by CHECK constraints

Pre-loaded Regulatory Corpus

Every Kenshiki Labs environment ships with a governed evidence base covering major AI governance standards, compliance frameworks, and industry-specific regulatory guidance — 2,200+ chunks, pre-tagged with SIRE identity and relationship mappings. Each framework is mapped through the Ontic Compliance Catalog to enforceable obligations, so the SIRE gate knows which evidence must exist before a governed request can proceed. Governed inference works on day one. Add your own documents on top.

  • EU AI Act (Regulation 2024/1689)
  • EU GDPR
  • HIPAA Administrative Simplification
  • PCI DSS 4.0.1
  • ISO/IEC 27001:2022 — Information Security
  • ISO/IEC 42001:2023 — AI Management System
  • ISO/IEC 23894:2023 — AI Risk Management
  • NIST AI Risk Management Framework 1.0
  • NIST AI 600-1 — Generative AI Profile
  • NIST Cybersecurity Framework 2.0
  • AICPA Trust Services Criteria (SOC 2)
  • DOJ Evaluation of Corporate Compliance Programs
  • 28 industry verticals: Financial Services, Healthcare, Defense & Intelligence, Government, Legal, Energy, Life Sciences, Education, and more
  • Ontic Compliance Catalog maps each framework to abstract obligations — the SIRE gate enforces them before retrieval

How Documents Are Parsed

Docling runs GPU-accelerated layout analysis (DocLayNet), table extraction (TableFormer), and OCR (EasyOCR). Output is enriched before chunking — not after, and not by the model.

  • Two-stage pipeline: GPU parse → CPU enrichment
  • Clause ID extraction for regulatory citations (e.g., "DFARS 252.204-7012")
  • Normative language detection (SHALL/MUST/REQUIRED flags)
  • Cross-reference resolution between sections and documents
  • Quality gate rejects OCR garbage, TOC entries, and low-density chunks
  • SIRE identity tags stamped during enrichment
  • Failed parses after 3 retries: quarantined, not dropped. Ledger receives a DEGRADED_BOUNDARY annotation.

SIRE Identity System

SIRE (Subject, Included, Relevant, Excluded) is deterministic identity metadata embedded in source frontmatter during ingestion. It defines what each source covers, relates to, and must never answer. Only Excluded enforces — the other three inform discovery.

  • Subject: anchors the source to a domain (e.g., soc_2_trust_services_criteria, eu_ai_act)
  • Included: enriches search with covered terminology (e.g., 'conformity assessment', 'cardholder data')
  • Relevant: maps cross-source topology (e.g., ISO 27001 → SOC 2; NIST AI RMF → EU AI Act)
  • Excluded: hard boundary (e.g., SOC 2 excludes 'sox', 'gaap', 'hipaa')
  • Exclusion gate purges matching chunks at retrieval — case-insensitive, word-boundary match
  • SIRE proposals generated by keyword frequency scan, then manually curated before application

Retrieval and Access Control

At retrieval time, Kura applies the caller's access boundary to candidate evidence. The model only sees what the caller is authorized to use for this specific question. Tenant-scoped enforcement is live today; caller-specific OpenFGA/ReBAC retrieval enforcement is the next boundary.

  • Hybrid retrieval (pgvector + tsvector) ranked by semantic + lexical similarity
  • Chunks grouped by SIRE subject, ranked by mean relevance
  • SIRE exclusion gate purges out-of-scope chunks before they reach the model
  • Caller-specific evidence scoping via OpenFGA/ReBAC is the next retrieval boundary
  • Tenant-scoped row-level security on every evidence table

Relationship Discovery

Kura's internal crosswalk subsystem processes the full evidence library to map coverage, overlaps, conflicts, and routing paths so multi-source governance can run deterministically.

  • Declared relationships: matches Excluded-to-Included across all sources, validates Relevant references, detects Subject overlap
  • Discovered relationships: cross-source embedding similarity (cosine threshold 0.80) finds coverage SIRE tags missed
  • Registry merge: confirmed, declared-only, discovered-only, and conflict relationships in one authority map with O(1) concept lookup

What Kura Makes Inspectable

Every chunk carries an immutable attribution chain: embedding authority, egress policy, and pipeline run attestation. Un-attributed chunks are structurally impossible — enforced by database constraints, not application logic.

  • Chunk watermarks verify without database access (HMAC-SHA-256)
  • Immutable event log for every embedding operation
  • Sovereignty attribution on every vector, enforced by CHECK constraint
  • Designed for air-gap and VPC deployment

How to use Kura

POST /v2/documents with your source files. Kura parses, chunks, embeds, and tags them. Retrieval happens automatically when Kadai processes a governed request — or call the retrieval API directly. The same API works across all three tiers.

  • Ingest: POST /v2/documents with PDF, DOCX, JSON, Markdown, YAML, or CSV. Kura handles extraction, SIRE tagging, chunking, and embedding.
  • Retrieve: GET /v2/documents to list. Retrieval for governed responses is automatic through KadaiKura scopes by the caller boundary and source identity.
  • Workshop: Kura runs on shared Kenshiki Labs Aurora PostgreSQL with pgvector. Ingest via REST API from anywhere. Pre-loaded regulatory corpus available on day one.
  • Refinery: Kura runs inside your private deployment. Ingestion endpoints are internal to your VPC. Evidence stays inside your boundary.
  • Clean Room: Kura runs on local Aurora-compatible database inside the air gap. Documents ingested via secure media transfer — no network path.
  • Same ingestion API, same SIRE tagging, same retrieval interface, same ReBAC access control — the caller code does not change between tiers
  • Full API reference: /articles/governed-intelligence-api

Dependency on the Compiler

The metadata Kura stamps on every chunk — clause IDs, normative markers, SIRE tags, source tier — is what Compiler uses to decide which CFPO zone each piece of evidence belongs in. Without this metadata, the Compiler cannot make informed zone decisions, and the prompt contract is ungoverned.

  • Normative mandates (SHALL/MUST) → Policy zone
  • Structural definitions and schemas → Format zone
  • Advisory narrative and context → Content zone
  • If a chunk arrives without SIRE tags or normative markers, it defaults to the Content zone with reduced authority weight

Who this is for

Corpus engineers

data stewards who curate, version, and maintain authoritative source collections inside the evidence boundary. Responsible for ingestion, SIRE tagging, and evidence quality.

Every downstream system

Compiler draws evidence for zone mapping. Ledger checks claims against it. Gate relies on it for emission policy. Kadai returns answers bounded by what Kura contains.

Kura — the governed evidence store — is the evidence boundary. SIRE (Subject, Included, Relevant, Excluded) identity tags scope what each source covers and what it must never answer. Every chunk carries provenance chains, SHA-256 hashes, and HMAC-SHA-256 watermarks. Compiler — the prompt-assembly engine —, Ledger — the integrity-protected inference audit trail —, and Gate — the emission policy boundary — all depend on Kura as the source of governed evidence.