The full pipeline, the masking stack, the model orchestration, the audit chain — component by component, with the stack named. Written for people who read pipelines, not promises. Every present-tense claim on this page is true in the codebase today and verified against our synthetic canary corpus.
ROUTING DOCTRINE — Local by default. Anything going to cloud goes masked
by default. Raw cloud only by explicit, logged override.
ACCESS DOCTRINE — Default-deny. No rule row, no access. Folder policies can
tighten a file's routing, never loosen it.
FAILURE DOCTRINE — Over-masking beats under-masking. Uncertainty escalates;
it never guesses. No file exits the pipeline unclassified.
Every file in your archive passes through the same staged pipeline. Each stage writes its decision to governance.db — a SQLite database that is the single source of truth for what every file is, how sensitive it is, and what any model is ever allowed to do with it.
Files are identified by what they actually are, not what the extension claims. Three detection layers in priority order: MIME inspection (definitive), filename patterns (high confidence), content patterns (inferred). ZIP archives are extracted in place and every member re-enters the full pipeline. Encrypted archives are detected and flagged rather than skipped silently.
Exact duplicates by SHA-256. Near-duplicates by MinHash signatures with locality-sensitive hashing at a 0.90 similarity threshold — five copies of the same policy document across three old drives collapse to one canonical file, with every duplicate path recorded against it.
Every file is typed against 44 document types — payslip, engagement letter, board minutes, court bundle, DBS register, care plan, NDA, HMRC correspondence and 36 more — and tagged with sensitivity labels (criminal, health, legal, financial, credentials, third-party). Classification runs on local models with a confidence loop: uncertain answers escalate to a bigger model before anything is accepted (§3).
Labels drive routing. credentials and criminal are force-blocked: no cloud, no proxy, and credentials block even local model reads by default. legal, health, financial, third-party, PII are private-but-maskable: local by default, cloud only via the masking path. Folder policies apply most-restrictive-wins — a folder rule can tighten a file's routing but never loosen it. Untagged files containing PII get a conservative fallback rather than a guess. A safety net forces masking for NHS numbers, National Insurance numbers and UTRs regardless of folder policy.
Photos are captioned and CLIP-classified; images containing people are routed to owner review and never sent to cloud automatically. Audio gets voice-activity detection, speaker diarisation and event detection. Long video is segmented before analysis. Spreadsheets, calendar files and email exports all have dedicated extraction paths. A blank CLIP result escalates — no file exits unclassified.
Cleared content is chunked and embedded into a local vector store. The index is built from the promoted knowledge tree only — never from raw staging. Masked-tier files are indexed from their masked artefacts, not their originals. Retrieval is scoped per department: each assistant queries only its own folder partition, and every answer carries citations to the source documents it came from.
No single NER engine is good enough, so we don't use one. Masking is six layers deep — deterministic structure first, statistical models behind it, a semantic model behind that, and a scanner that audits the result. Each layer exists because we watched the previous one miss something.
The components above are open-source and anyone can download them. The choreography is the product. AIS combines models the way you'd staff a careful office: cheap fast judgement first, a more thoughtful reviewer behind it, arguments when they disagree, and a human when the machines aren't sure. Nothing guesses.
Classification starts on a small fast local model. If confidence is below threshold, a larger local reasoning model takes over and the loop repeats — up to three attempts. Still uncertain? The file escalates out of the machine entirely, into the human approval queue. Acceptance requires confidence, not exhaustion.
Our classifier validation harness runs two models in parallel on the same evidence. When they disagree, a third reasoning model sees both answers and both reasonings, and rules — with an explicit bias: when in doubt, rule more restrictive, because in governance a false negative is worse than a false positive. If the arbitrator is ambiguous, the file is marked owner-decides. It does not guess.
R&D harness — how we validate classifier behaviour before it ships.
Masking doesn't run once and hope. It runs, the visible-risk scanner audits the output, and the loop repeats until the scan is clean. The same pattern gates the Tier-2 cloud path: a post-distillation check aborts the entire send if a single forbidden entity survives.
For judgement questions, the portal convenes a five-perspective sequential panel: each panellist model answers in turn, sees what the previous panellists said, agrees or pushes back, then a synthesis pass distils the debate. Sequential on purpose — a panel that can't hear itself is just five guesses.
Human review isn't a dead end — it's training signal. Every Confirm or Block in the review UI upgrades the entity register and rewrites routing rules in governance.db. The system's masking and routing measurably improve with every decision the owner makes.
Uncertain files queue for human sign-off before any cloud send. People photos always stop at owner review. Review-tier files require an explicit approval. The human gate is a policy decision, not a quality check — the engine handles quality; the owner decides exposure.
When a job genuinely needs a frontier model, the content doesn't go raw. It goes through a pipeline designed so that your identifiers never touch the cloud model and the entity map never leaves the box.
Entities become stable placeholders ([PERSON_1], [ORG_2]) so the cloud model can still reason about the relationships. A local model distils the content; the leakage gate then re-scans the distillation and aborts the entire send if any forbidden entity survived. Only then does anything cross the wire. The placeholder map stays local, and the response is rehydrated on your hardware. Every step is one audit row.
Verified — 0 PII leaks across 13 audited canary runsRouting decisions are worthless if the chat box ignores them, so it can't. Every prompt assembled for a cloud model passes a default-deny eligibility gate — at the retrieval layer, not the UI layer.
Context reaches a cloud-bound prompt only if the source artefact is in the cloud-clean tier or is a masked artefact (masked by construction at promotion — the raw original never enters the knowledge tree for masked-tier files). Files with outstanding visible risks are refused regardless of tier. Department assistants (KIDs) are bound to local models — selecting a cloud model for a KID is overridden, visibly. If nothing in scope is cloud-eligible, the chat falls back to a local model with full context rather than sending a cloud model nothing — and tells you it did. Every withholding and override is surfaced in the response itself.
Every read, classification, masking application and routing decision is one append-only row. No updates, no deletes.
Each row carries the source content hash, the output content hash, the policy version that made the decision, and a correlation ID that chains the events of one operation together — so a masked cloud send is provably read → classify → mask_apply → llm_send_cloud_proxy with the mask output hash matching the proxy input hash. A director can answer "what has any AI ever seen of this file?" with a query, not a meeting.
Live — 3,472 audit entries across the five-business canary corpusYou can't demo a privacy product on a customer's private data, and a vendor who demos on lorem ipsum has tested nothing. So we built five complete synthetic businesses — an accountancy practice, a care group, a recruitment firm and two more — with payrolls, invoices, board minutes, HMRC correspondence, legal matters and staff photos, graded across difficulty tiers from "must block" to "publish freely", seeded with deliberate traps.
Every engine change runs against the canary corpus before it ships. The masking recall suite asserts entity-by-entity on real document structures — CSV payrolls, email threads, scanned letters — not toy strings. Company names in the corpus are checked against the live Companies House register so no synthetic entity collides with a real business. The same corpus powers the demo: what you're shown is the verification, not a performance.
No mystery boxes. The moat isn't the parts list — it's the UK recognisers, the routing doctrine, the orchestration and the canary methodology.
| Component | Role | Why this one |
|---|---|---|
| Presidio | Pattern NER | Microsoft's PII engine — extended with custom UK recognisers (NINO, UTR, sort codes, postcodes) and context gating |
| GLiNER | Semantic NER | Zero-shot entity detection — catches rare names and building names pattern engines miss |
| spaCy | Statistical NER | en_core_web_md behind Presidio for person/location coverage |
| phi4 | Fast local judgement | First-pass classification and Tier-2 distillation — cheap, quick, good enough until it isn't |
| gemma (4/3-class) | Local reasoner | Escalation target when phi4 is uncertain; benchmarked against alternatives on our own corpus |
| deepseek-r1 | Arbitrator | Rules on classifier disagreements in the validation harness — restrictive bias |
| Claude | Cloud tier | The only cloud model wired in — reached exclusively through the masked Tier-2 path or the chat gate. Routed on demand only |
| Ollama | Local model runtime | Every local model runs on your hardware via Ollama — no cloud dependency for core work |
| ChromaDB | Vector store | Local persistent retrieval index — built from promoted, masked artefacts only |
| nomic-embed-text | Embeddings | Local embedding model — your documents are never embedded by a cloud API |
| CLIP + PANNs + Resemblyzer | Media analysis | Image classification, audio event detection, speaker diarisation — all local |
| MinHash/LSH | Near-dedup | Sub-quadratic similarity at archive scale, 0.90 threshold |
| SQLite | governance.db + audit | Single-file, inspectable, no server to misconfigure — the audit chain is a file you can copy and query |
| Apache Tika | Text extraction | Content-first type detection across the long tail of office formats |
The fastest way to lose a technical reader is to pretend. These are designed, not shipped:
When one of these ships, it moves up the page — with its verification numbers. That's the deal.