Your organization knows more
than any database can hold

The knowledge that matters most — incident patterns, tribal expertise, research hunches, the thing your ops team figured out last Tuesday — lives in conversations and documents that vanish. LLMs can finally read all of it. Substrate is where they put what they find.

$ pip install substrate-db

The Knowledge That Never Gets Captured

Your best engineer knows the payment service fails at 180 connections, not 200 like the docs say. A researcher noticed a link between two pathways nobody's published yet. Your ops team figured out that Tuesday deploys cause more incidents — it's buried in a Slack thread from March.

This is the most valuable knowledge your organization has. None of it is in any database. It can't be — databases need clean, single-version-of-truth data. Real organizational knowledge is contradictory, multi-source, and changes every week. It breaks INSERT statements and ON CONFLICT clauses.

LLMs change this. They can read Slack exports, scan incident reports, process research papers. They can finally access this knowledge. But they need somewhere to put it that handles contradictions natively and tracks who said what.

The Same Fact, Three Systems

"The East Palestine water is contaminated." Here's how three different databases store that:

Relational DB
INSERT INTO events
(name, status)
VALUES ('EP Derailment', 'contaminated');
Who said contaminated? The EPA? A Reddit post? A company? No idea.
Knowledge Graph
(Derailment)-[:CAUSED]->(Contamination)
An edge exists. Says nothing about who established it or how certain we are.
Substrate
AP News citing Ohio DNR says
Derailment caused Contamination
confidence: 0.90
+ residents corroborate (0.55)
An agent extracted this — you can see exactly what it was looking at, how confident it was, and who else agrees.

When an agent extracts 500 facts from your Slack channels overnight, you need every write to carry its source. Not as metadata you hope someone fills in — as a hard requirement the engine enforces. If two sources contradict each other, both claims coexist. When one is discredited, you retract it and the other survives.

See It Work

Watch knowledge build up from multiple sources — then see what happens when one turns out to be wrong. Each scenario auto-plays and loops.

Norfolk Southern company press release
“The situation is contained after controlled burn” 0.40
Residents on X social media · photos
Dead fish in Sulphur Run creek — waterway contamination 0.55
AP News / Ohio DNR wire service · state agency
3,500 fish killed — contamination confirmed 0.90 CORROBORATED
RETRACT Norfolk Southern press release
“Contained” claim removed. Contamination claims survive — independently corroborated by state agency and residents.
Result: Contamination confirmed by 2 independent sources. Company self-report retracted with full audit trail. Every claim traces to who said it.
Friend Marcus personal recommendation
“Sal's Pizza is excellent — best in the neighborhood” 0.85
Yelp platform rating
3 stars — “Average” 0.60
Coworker Dana first-hand experience
Got food poisoning last month 0.90 CONTRADICTS
RETRACT Yelp — rating manipulation
Yelp rating removed. Marcus and Dana's first-hand claims both survive — contradictory, but both are real experiences.
Result: Two real opinions remain. Marcus says great, Dana says dangerous. Both traceable. The database doesn't hide the disagreement.
Alice team Slack channel
“Bob knows React — he helped me with a component last year” 0.70
Bob's resume self-reported document
Skills: React, TypeScript, Node.js 0.80 CORROBORATED
GitHub PRs last 90 days · actual code
47 PRs — all Python, zero React 0.95 CONTRADICTS
RETRACT resume — 3 years outdated
Resume claim removed. Alice's Slack message survives at single-source (0.70). GitHub evidence (0.95) tells the real story.
Result: Code evidence (0.95) vs. hearsay (0.70). The contradiction is visible, not hidden. You see exactly why the data disagrees.
When a source turns out to be wrong:
Row-based approach
The typical response is to UPDATE or DELETE the row. If downstream reports already used that data, there's no automatic way to trace the impact.
Edge-based approach
The typical response is to remove the edge. Unless provenance metadata was manually maintained, tracing what depended on it requires custom logic.
Embedding-based approach
The typical response is to delete and re-embed. Answers already generated from the old embedding are not automatically traceable.
Substrate
Retract the source. Corroborated facts survive. Full audit trail. The knowledge base heals itself.

Knowledge That Compounds

This is the part that matters most.

Run Substrate for a week and you have structured notes. Run it for six months and you have a reality model — an emergent map of everything your organization knows, where the knowledge is deep, where it's thin, and where different domains connect in ways no single person noticed.

Topics emerge automatically from the claim graph. Your agents extracted 300 claims about your auth system and 4 about data center power capacity. That's not just data — that's a map of where you're knowledgeable and where you're blind. The insight engine finds connections across domain boundaries: auth failures always follow database connection issues, but nobody documented the dependency.

Two years of accumulated evidence can't be speed-run by a competitor. The topology gets richer. Cross-domain connections surface. The organization that started earlier has an advantage that compounds daily and is nearly impossible to replicate.

Practicalities

pip install substrate-db — single-file database, no server, no infrastructure. Point it at a Slack export, a ChatGPT conversation, or a folder of documents — the built-in extractor (heuristic or LLM-powered, 7 providers supported) pulls out claims with provenance tracing to the exact message. Heuristic mode needs no API keys.

Want a visual interface? pip install substrate-console gives you a browser dashboard that connects to live Slack, Gmail, and Google Docs via OAuth. Ingest your company's knowledge, explore an interactive graph, and ask natural-language questions — all for ~$0.06 on Groq's free tier.

Provenance is required on every write — the engine rejects claims without a source, whether the writer is a human or an agent. Batch mode handles millions of claims via the Rust backend. db.at(timestamp) gives you point-in-time queries — what the agents knew last Tuesday, before the new data came in.

For AI agents: the built-in MCP server lets Claude and other MCP-compatible agents read and write Substrate directly. A REST API at /api/v1/ serves any HTTP client. Event hooks (db.on("claim_ingested", callback)) let you build reactive pipelines that trigger when knowledge changes.

Built For

Research

Scientific Teams

An agent reads 200 papers overnight and extracts findings. It notices your team has deep knowledge on kinase inhibition and almost nothing on the metabolic pathway that might connect to it — a gap no single researcher would see.

Operations

Engineering Teams

Agents ingest 18 months of Slack incident channels and postmortem docs. "What breaks if Redis goes down?" — answered from claims extracted across hundreds of incidents, each traceable to the person who figured it out.

Any Team

Anyone Building Agents

If your agents produce knowledge — from customer calls, experiments, market research, code reviews — Substrate is the database layer that makes it compound instead of evaporate.

Get Started

Slack Teams LLM Chat Documents Email Databases External
Substrate
Extract · Store · Query · Correct · Discover
MCP Server REST API Dashboard Python SDK NDJSON
$ pip install substrate-db substrate-console
$ substrate-console my_company.db

Opens a dashboard at localhost:8877. Click Connect Slack — authorize your workspace. Click Connect Google — authorize Gmail, Drive, Docs. Go to Ingest. Pick your channels. Hit go.

No API keys. No OAuth apps to create. No environment variables. Your data flows directly between your machine and Slack/Google. Full quick start guide →