What "Claim-Native" Means

A traditional database stores data. You INSERT a row and it's there. The database doesn't know if the data is true, where it came from, or what happens if the source turns out to be wrong. Every row is equally "real."

A knowledge graph improves on this by storing relationships — nodes and edges. But it has the same blind spot. Every edge is equally "true." There's no memory of why it's there.

Substrate stores claims. A claim is an assertion with a source attached: who said it, when, and how confident they are. This is a fundamentally different primitive, and it changes what the database can do.

Claims

A claim is the smallest unit of knowledge in Substrate:

db.ingest(
    subject=("api-gateway", "service"),        # What entity
    predicate=("depends_on", "depends_on"),     # What relationship
    object=("redis", "service"),                # To what entity
    provenance={                                # Where this came from
        "source_type": "config_management",
        "source_id": "k8s-manifest-v2.3",
    },
    confidence=1.0,                              # How certain (0.0 - 1.0)
)

This isn't just a labeled edge in a graph. It's a record that says: "The Kubernetes manifest v2.3 asserts that api-gateway depends on redis, with full confidence." The source is part of the data, not metadata.

Claims are immutable. Once recorded, they're never modified. New information creates new claims. If a source is wrong, a retraction creates a tombstone — the original claim is preserved for audit.

Provenance

Every claim must have a source. This isn't a best practice — it's enforced by the engine. Writes without provenance are rejected. This means you can always answer two questions: "Where did we learn this?" and "Should we still trust it?"

Source types describe the kind of source:

Source TypeExamples
config_managementKubernetes manifests, Terraform configs, org charts
chat_extractionChatGPT conversations, Claude sessions, Slack threads
experiment_logML experiment results, A/B test outcomes
monitoringDatadog, PagerDuty, Prometheus alerts
database_importBulk imports from external databases
human_annotationManual entries by domain experts
experimentalLab results, assay data
literature_extractionFindings from papers and documents
clinical_trialClinical study results

The source_id identifies the specific source: a paper DOI, a K8s manifest version, a Slack channel and thread, an experiment run ID. Combined with the source type, this gives you a complete audit trail for every fact in the database.

Corroboration

When the same fact shows up from multiple independent sources, that's a stronger signal than a single source saying it. Substrate tracks this automatically.

# A Kubernetes manifest says api-gateway depends on Redis
db.ingest(..., provenance={"source_id": "k8s-manifest", ...})

# An incident response chat independently confirms it
db.ingest(..., provenance={"source_id": "chat:incident-42:turn:0", ...})

# Both claims point to the same fact — corroboration is tracked
group = db.claims_by_content_id(claims[0].content_id)
print(f"{len(group)} independent sources confirm this")

This is why claim-native matters. In a traditional database, you'd have two rows with the same data — a duplicate. In Substrate, you have one fact with two sources — corroboration. The difference becomes critical during retraction.

Retraction and Self-Correction

Sources can be wrong. Runbooks go stale, papers get retracted, configs change. In a traditional database, you'd delete the bad data and hope nothing depended on it.

Substrate handles this structurally:

# A runbook turns out to be outdated
cascade = db.retract_cascade("runbook-redis-v1", reason="Outdated procedure")
print(f"Retracted: {cascade.source_retract.retracted_count}")
print(f"Downstream degraded: {cascade.degraded_count}")

Nothing is deleted. The original claims are preserved. They're just marked so that queries know to treat them differently. This is what "self-correcting" means — the engine handles the consequences of bad data automatically.

Time Travel

Every claim carries a timestamp. You can query the knowledge base as it existed at any point in the past:

import time

# What did we know yesterday?
yesterday = time.time_ns() - 86_400 * 10**9
snapshot = db.at(yesterday)
frame = snapshot.query("api-gateway", depth=1)

This is possible because claims are immutable and append-only. New knowledge doesn't overwrite old knowledge — it layers on top. "What was known about the auth service when we decided to migrate it?" is a query, not a forensic investigation.

The Extraction Pipeline

Most knowledge isn't structured. It's in conversations, documents, and Slack threads. Substrate has a built-in extraction pipeline that turns unstructured text into claims:

  1. Parse — Break the input into messages or sections
  2. Group — Pair user questions with assistant answers
  3. Extract — Identify structured claims in the text
  4. Curate — Filter contradictions and low-quality claims
  5. Ingest — Store each claim with provenance tracing to the source conversation and turn

Every extracted claim carries its provenance: which conversation, which turn, which extraction method. You can always trace back to the original text.

ModeAPI Key?When to use
"heuristic"NoExplicit relational text ("X depends on Y"). Fast and free.
"llm"YesNuanced or implicit relationships. 7 providers supported.
"smart"YesLarge volumes. Heuristic first, LLM only for new content. Saves cost.

Vocabularies

A vocabulary tells Substrate what kinds of entities and relationships exist in your domain. It enforces type constraints — so a service can depend_on another service, but not on a feature.

VocabularyEntity TypesRelationshipsDomain
biogene, protein, compound, disease, pathway, ...binds, inhibits, treats, associated_with, ...Biomedical research
devopsservice, incident, alert, team, runbook, ...depends_on, triggers, monitors, owns, ...Infrastructure
mlmodel, dataset, feature, experiment, ...trained_on, outperforms, uses_feature, ...ML experiments

You can register multiple vocabularies on the same database, or define your own.

The Derived Graph

Substrate produces a graph — entities connected by relationships, traversable with query() and path_exists(). But the graph is derived from claims, not primary. Every edge traces back to one or more claims, and every claim traces back to a source.

If you retract all claims from a source, the edges that depended solely on that source disappear. Edges with independent corroboration survive. The graph is a view of the claim log — always consistent, always auditable.