Quick Start — Substrate

Install

$ pip install substrate-db

Optional extras:

$ pip install substrate-db[llm]       # LLM-powered extraction (7 providers)
$ pip install substrate-db[science]   # Community detection, eval tools
$ pip install substrate-db[all]       # Everything

1. Create a Knowledge Base

One line to set up a database with a domain vocabulary and curator:

import substrate

db = substrate.quickstart("my_team.db", vocabs=["devops"], curator="heuristic")

Available vocabularies: "bio" (biomedical research), "devops" (infrastructure), "ml" (experiments), or "all". Each vocabulary defines what entity types and relationships are valid for your domain.

2. Add Claims

Directly — with provenance

Every fact needs a source. The engine won't accept a claim without one.

db.ingest(
    subject=("api-gateway", "service"),
    predicate=("depends_on", "depends_on"),
    object=("redis", "service"),
    provenance={"source_type": "config_management", "source_id": "k8s-manifest-v2.3"},
    confidence=1.0,
)

From a conversation

Substrate extracts claims automatically and traces each one to the conversation turn it came from.

result = db.ingest_chat([
    {"role": "user", "content": "What depends on the auth service?"},
    {"role": "assistant", "content": (
        "API Gateway depends on Auth Service for request validation. "
        "The mobile app depends on Auth Service for OAuth token exchange."
    )},
], conversation_id="architecture-review", extraction="heuristic")

print(f"Extracted {result.claims_ingested} claims")

From a Slack export

# Export from Slack admin settings, then:
results = db.ingest_slack("team_slack_export.zip", extraction="heuristic")
for r in results:
    print(f"  {r.conversation_id}: {r.claims_ingested} claims extracted")

From a ChatGPT export

# Settings → Data controls → Export data in ChatGPT, then:
results = db.ingest_chat_file("conversations.zip")
for r in results:
    print(r.summary)

In batch

from substrate import ClaimInput

claims = [
    ClaimInput(
        subject=("api-gateway", "service"),
        predicate=("depends_on", "depends_on"),
        object=("redis", "service"),
        provenance={"source_type": "config_management", "source_id": "k8s"},
    ),
    # ... more claims, each with its own source
]
result = db.ingest_batch(claims)

3. Query What You Know

frame = db.query("auth-service", depth=2)

print(f"{frame.focal_entity.name}: {frame.claim_count} claims")
for rel in frame.direct_relationships:
    print(f"  {rel.target.name} --[{rel.predicate}]--> {frame.focal_entity.name}")

# Human-readable summary of everything known about this entity
print(frame.narrative)

The query result includes every relationship, the confidence score for each (based on source quality and corroboration), and a narrative summary.

4. Explore, Verify, and Correct

See what the knowledge base contains

schema = db.schema()
print(schema)
# Entity types: service(12), team(4), alert(3)
# Predicates: depends_on(18), owns(6), monitors(4)
# Relationship patterns:
#   service --[depends_on]--> service  (18x)
#   team --[owns]--> service  (6x)

Check corroboration

# How many independent sources confirm this fact?
group = db.claims_by_content_id(some_claim.content_id)
print(f"{len(group)} sources corroborate this")

Retract a bad source

# An old runbook turns out to be wrong
db.retract_cascade("runbook-redis-v1", reason="Outdated procedure")
# Claims from this source are retracted
# Downstream dependents are marked as degraded
# Facts with independent corroboration survive

Find knowledge gaps

gaps = db.find_gaps(
    {"service": {"depends_on", "monitors"}},
    min_claims=1,
)
for g in gaps:
    print(f"  {g.entity_id}: missing {g.missing_predicate_types}")

Track research questions

# Register a question
db.ingest_inquiry(
    question="Does the payment service depend on Redis?",
    subject=("payment-service", "service"),
    object=("redis", "service"),
    predicate_hint="depends_on",
)

# Later, when new claims arrive, check for matches
matches = db.check_inquiry_matches(subject_id="payment-service")

Extraction Modes

Mode	Needs API Key?	Best For
`"heuristic"`	No	Text with explicit relational statements. Fast and free.
`"llm"`	Yes	Nuanced text, implicit relationships. 7 providers supported.
`"smart"`	Yes	Large volumes. Heuristic first, LLM only for novel content. Saves tokens.

To use LLM extraction, set the appropriate environment variable (GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and configure the curator:

db.configure_curator("groq")  # Free tier available

Backend Options

db = substrate.open("my.db")                       # Kuzu (default)
db = substrate.open("my.db", backend="rust")      # Rust (high throughput)
db = substrate.open("my.db", backend="auto")      # Try Rust, fall back to Kuzu

Both backends have the same API and produce identical results. The Rust backend is faster for bulk ingestion and large datasets.