Kenshiki

Evidence store

Kura Index

Store what counts as real.

You POST source material into Kura. The system preserves provenance, structure, and retrieval boundaries so every downstream answer traces back to something real. Documents are parsed by Docling, enriched with clause IDs, normative markers, and SIRE identity tags, then chunked and embedded into the governed evidence store. The Crosswalk scopes retrieval by caller identity and source boundaries.

Without Kura, every downstream decision is an assertion without evidence. The Compiler cannot scope what the model sees. The Ledger has nothing to check claims against. The Gate has no basis. No evidence, no grounded answer.

Why Kura Exists

Standard RAG retrieves whatever is nearest in embedding space and hands it to the model. No authority boundary, no provenance, no access control, no way to prove what the model was allowed to see. Governed inference requires a governed evidence boundary.

  • RAG without authority boundaries is retrieval, not governance
  • The model must not see evidence the caller cannot access
  • Post-generation scoring cannot fix what was never in scope
  • Every claim in the Ledger traces back to a specific chunk with provenance

What Kura Does

Transforms source documents into a queryable, tamper-evident knowledge base. Every chunk carries provenance from upload through embedding.

  • SHA-256 source hash, idempotent upsert, version-aware change detection
  • Section-aware chunking on heading boundaries with merge for undersized chunks
  • HMAC-SHA-256 watermarks per chunk — verification without database access
  • Embedding via text-embedding-3-large (512d Matryoshka)
  • Tenant provenance on every row, enforced by CHECK constraints

Pre-loaded Regulatory Corpus

Every Kenshiki environment ships with a governed evidence base covering major AI governance standards, compliance frameworks, and industry-specific regulatory guidance — 2,200+ chunks, pre-tagged with SIRE identity and crosswalk mappings. Each framework is mapped through the Ontic Compliance Catalog to enforceable obligations, so the SIRE gate knows which evidence must exist before a governed request can proceed. Governed inference works on day one. Add your own documents on top.

  • EU AI Act (Regulation 2024/1689)
  • EU GDPR
  • HIPAA Administrative Simplification
  • PCI DSS 4.0.1
  • ISO/IEC 27001:2022 — Information Security
  • ISO/IEC 42001:2023 — AI Management System
  • ISO/IEC 23894:2023 — AI Risk Management
  • NIST AI Risk Management Framework 1.0
  • NIST AI 600-1 — Generative AI Profile
  • NIST Cybersecurity Framework 2.0
  • AICPA Trust Services Criteria (SOC 2)
  • DOJ Evaluation of Corporate Compliance Programs
  • 28 industry verticals: Financial Services, Healthcare, Defense & Intelligence, Government, Legal, Energy, Life Sciences, Education, and more
  • Ontic Compliance Catalog maps each framework to abstract obligations — the SIRE gate enforces them before retrieval

How Documents Are Parsed

Docling runs GPU-accelerated layout analysis (DocLayNet), table extraction (TableFormer), and OCR (EasyOCR). Output is enriched before chunking — not after, and not by the model.

  • Two-stage pipeline: GPU parse → CPU enrichment
  • Clause ID extraction for regulatory citations (e.g., "DFARS 252.204-7012")
  • Normative language detection (SHALL/MUST/REQUIRED flags)
  • Cross-reference resolution between sections and documents
  • Quality gate rejects OCR garbage, TOC entries, and low-density chunks
  • SIRE identity tags stamped during enrichment
  • Failed parses after 3 retries: quarantined, not dropped. Ledger receives a DEGRADED_BOUNDARY annotation.

SIRE Identity System

SIRE (Subject, Included, Relevant, Excluded) is deterministic identity metadata embedded in source frontmatter during ingestion. It defines what each source covers, relates to, and must never answer. Only Excluded enforces — the other three inform discovery.

  • Subject: anchors the source to a domain (e.g., soc_2_trust_services_criteria, eu_ai_act)
  • Included: enriches search with covered terminology (e.g., 'conformity assessment', 'cardholder data')
  • Relevant: maps cross-source topology (e.g., ISO 27001 → SOC 2; NIST AI RMF → EU AI Act)
  • Excluded: hard boundary (e.g., SOC 2 excludes 'sox', 'gaap', 'hipaa')
  • Exclusion gate purges matching chunks at retrieval — case-insensitive, word-boundary match
  • SIRE proposals generated by keyword frequency scan, then manually curated before application

Retrieval and Access Control

At retrieval time, the Crosswalk scopes evidence by the caller's access boundary via OpenFGA/ReBAC. The model only sees what the caller is authorized to use for this specific question.

  • Hybrid retrieval (pgvector + tsvector) ranked by semantic + lexical similarity
  • Chunks grouped by SIRE subject, ranked by mean relevance
  • SIRE exclusion gate purges out-of-scope chunks before they reach the model
  • Per-caller evidence scoping via OpenFGA relationship-based access control
  • Tenant-scoped row-level security on every evidence table

Relationship Discovery

The Crosswalk processes the full evidence library to map coverage, overlaps, conflicts, and routing paths so multi-source governance can run deterministically.

  • Declared relationships: matches Excluded-to-Included across all sources, validates Relevant references, detects Subject overlap
  • Discovered relationships: cross-source embedding similarity (cosine threshold 0.80) finds coverage SIRE tags missed
  • Registry merge: confirmed, declared-only, discovered-only, and conflict relationships in one authority map with O(1) concept lookup

What Kura Proves

Every chunk carries an immutable attribution chain: embedding authority, egress policy, and pipeline run attestation. Un-attributed chunks are structurally impossible — enforced by database constraints, not application logic.

  • Chunk watermarks verify without database access (HMAC-SHA-256)
  • Immutable event log for every embedding operation
  • Sovereignty attribution on every vector, enforced by CHECK constraint
  • Designed for air-gap and VPC deployment

Tier Variations

Kura runs in every deployment tier. What changes is where the evidence store lives and who controls it.

  • Workshop: Kura runs in the shared Kenshiki environment. Evidence is managed by Kenshiki.
  • Refinery: Kura runs inside the customer's private deployment (VPC, GovCloud, or connected on-prem). Evidence stays inside the customer's boundary.
  • Clean Room: Kura runs on air-gapped, verified hardware. Evidence never leaves the customer's physical premises.

Dependency on the Prompt Compiler

The metadata Kura stamps on every chunk — clause IDs, normative markers, SIRE tags, source tier — is what the Prompt Compiler uses to decide which CFPO zone each piece of evidence belongs in. Without this metadata, the Compiler cannot make informed zone decisions, and the prompt contract is ungoverned.

  • Normative mandates (SHALL/MUST) → Policy zone
  • Structural definitions and schemas → Format zone
  • Advisory narrative and context → Content zone
  • If a chunk arrives without SIRE tags or normative markers, it defaults to the Content zone with reduced authority weight

Who this is for

Corpus engineers

data stewards who curate, version, and maintain authoritative source collections inside the evidence boundary. Responsible for ingestion, SIRE tagging, and evidence quality.

Every downstream system

the Prompt Compiler draws evidence for zone mapping. The Claim Ledger checks claims against it. The Boundary Gate relies on it for emission decisions. Kadai returns answers bounded by what Kura contains.