Install

$ pip install substrate-db

Optional extras:

$ pip install substrate-db[llm]       # LLM-powered extraction (7 providers)
$ pip install substrate-db[science]   # Community detection, eval tools
$ pip install substrate-db[all]       # Everything

1. Create a Knowledge Base

One line to set up a database with a domain vocabulary and curator:

import substrate

db = substrate.quickstart("my_team.db", vocabs=["devops"], curator="heuristic")

Available vocabularies: "bio" (biomedical research), "devops" (infrastructure), "ml" (experiments), or "all". Each vocabulary defines what entity types and relationships are valid for your domain.

2. Add Claims

Directly — with provenance

Every fact needs a source. The engine won't accept a claim without one.

db.ingest(
    subject=("api-gateway", "service"),
    predicate=("depends_on", "depends_on"),
    object=("redis", "service"),
    provenance={"source_type": "config_management", "source_id": "k8s-manifest-v2.3"},
    confidence=1.0,
)

From a conversation

Substrate extracts claims automatically and traces each one to the conversation turn it came from.

result = db.ingest_chat([
    {"role": "user", "content": "What depends on the auth service?"},
    {"role": "assistant", "content": (
        "API Gateway depends on Auth Service for request validation. "
        "The mobile app depends on Auth Service for OAuth token exchange."
    )},
], conversation_id="architecture-review", extraction="heuristic")

print(f"Extracted {result.claims_ingested} claims")

From a Slack export

# Export from Slack admin settings, then:
results = db.ingest_slack("team_slack_export.zip", extraction="heuristic")
for r in results:
    print(f"  {r.conversation_id}: {r.claims_ingested} claims extracted")

From a ChatGPT export

# Settings → Data controls → Export data in ChatGPT, then:
results = db.ingest_chat_file("conversations.zip")
for r in results:
    print(r.summary)

In batch

from substrate import ClaimInput

claims = [
    ClaimInput(
        subject=("api-gateway", "service"),
        predicate=("depends_on", "depends_on"),
        object=("redis", "service"),
        provenance={"source_type": "config_management", "source_id": "k8s"},
    ),
    # ... more claims, each with its own source
]
result = db.ingest_batch(claims)

3. Query What You Know

frame = db.query("auth-service", depth=2)

print(f"{frame.focal_entity.name}: {frame.claim_count} claims")
for rel in frame.direct_relationships:
    print(f"  {rel.target.name} --[{rel.predicate}]--> {frame.focal_entity.name}")

# Human-readable summary of everything known about this entity
print(frame.narrative)

The query result includes every relationship, the confidence score for each (based on source quality and corroboration), and a narrative summary.

4. Explore, Verify, and Correct

See what the knowledge base contains

schema = db.schema()
print(schema)
# Entity types: service(12), team(4), alert(3)
# Predicates: depends_on(18), owns(6), monitors(4)
# Relationship patterns:
#   service --[depends_on]--> service  (18x)
#   team --[owns]--> service  (6x)

Check corroboration

# How many independent sources confirm this fact?
group = db.claims_by_content_id(some_claim.content_id)
print(f"{len(group)} sources corroborate this")

Retract a bad source

# An old runbook turns out to be wrong
db.retract_cascade("runbook-redis-v1", reason="Outdated procedure")
# Claims from this source are retracted
# Downstream dependents are marked as degraded
# Facts with independent corroboration survive

Find knowledge gaps

gaps = db.find_gaps(
    {"service": {"depends_on", "monitors"}},
    min_claims=1,
)
for g in gaps:
    print(f"  {g.entity_id}: missing {g.missing_predicate_types}")

Track research questions

# Register a question
db.ingest_inquiry(
    question="Does the payment service depend on Redis?",
    subject=("payment-service", "service"),
    object=("redis", "service"),
    predicate_hint="depends_on",
)

# Later, when new claims arrive, check for matches
matches = db.check_inquiry_matches(subject_id="payment-service")

Extraction Modes

ModeNeeds API Key?Best For
"heuristic"NoText with explicit relational statements. Fast and free.
"llm"YesNuanced text, implicit relationships. 7 providers supported.
"smart"YesLarge volumes. Heuristic first, LLM only for novel content. Saves tokens.

To use LLM extraction, set the appropriate environment variable (GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and configure the curator:

db.configure_curator("groq")  # Free tier available

Backend Options

db = substrate.open("my.db")                       # Kuzu (default)
db = substrate.open("my.db", backend="rust")      # Rust (high throughput)
db = substrate.open("my.db", backend="auto")      # Try Rust, fall back to Kuzu

Both backends have the same API and produce identical results. The Rust backend is faster for bulk ingestion and large datasets.

Next Steps

Cookbooks

Full working examples for biomedical research, DevOps, and ML teams.

API Reference

Every method, parameter, and return type.

Core Concepts

What claim-native means and why provenance changes everything.