$ pip install substrate-db
Optional extras:
$ pip install substrate-db[llm] # LLM-powered extraction (7 providers) $ pip install substrate-db[science] # Community detection, eval tools $ pip install substrate-db[all] # Everything
One line to set up a database with a domain vocabulary and curator:
import substrate db = substrate.quickstart("my_team.db", vocabs=["devops"], curator="heuristic")
Available vocabularies: "bio" (biomedical research),
"devops" (infrastructure), "ml" (experiments),
or "all". Each vocabulary defines what entity types and
relationships are valid for your domain.
Every fact needs a source. The engine won't accept a claim without one.
db.ingest( subject=("api-gateway", "service"), predicate=("depends_on", "depends_on"), object=("redis", "service"), provenance={"source_type": "config_management", "source_id": "k8s-manifest-v2.3"}, confidence=1.0, )
Substrate extracts claims automatically and traces each one to the conversation turn it came from.
result = db.ingest_chat([ {"role": "user", "content": "What depends on the auth service?"}, {"role": "assistant", "content": ( "API Gateway depends on Auth Service for request validation. " "The mobile app depends on Auth Service for OAuth token exchange." )}, ], conversation_id="architecture-review", extraction="heuristic") print(f"Extracted {result.claims_ingested} claims")
# Export from Slack admin settings, then: results = db.ingest_slack("team_slack_export.zip", extraction="heuristic") for r in results: print(f" {r.conversation_id}: {r.claims_ingested} claims extracted")
# Settings → Data controls → Export data in ChatGPT, then: results = db.ingest_chat_file("conversations.zip") for r in results: print(r.summary)
from substrate import ClaimInput claims = [ ClaimInput( subject=("api-gateway", "service"), predicate=("depends_on", "depends_on"), object=("redis", "service"), provenance={"source_type": "config_management", "source_id": "k8s"}, ), # ... more claims, each with its own source ] result = db.ingest_batch(claims)
frame = db.query("auth-service", depth=2) print(f"{frame.focal_entity.name}: {frame.claim_count} claims") for rel in frame.direct_relationships: print(f" {rel.target.name} --[{rel.predicate}]--> {frame.focal_entity.name}") # Human-readable summary of everything known about this entity print(frame.narrative)
The query result includes every relationship, the confidence score for each (based on source quality and corroboration), and a narrative summary.
schema = db.schema() print(schema) # Entity types: service(12), team(4), alert(3) # Predicates: depends_on(18), owns(6), monitors(4) # Relationship patterns: # service --[depends_on]--> service (18x) # team --[owns]--> service (6x)
# How many independent sources confirm this fact? group = db.claims_by_content_id(some_claim.content_id) print(f"{len(group)} sources corroborate this")
# An old runbook turns out to be wrong db.retract_cascade("runbook-redis-v1", reason="Outdated procedure") # Claims from this source are retracted # Downstream dependents are marked as degraded # Facts with independent corroboration survive
gaps = db.find_gaps( {"service": {"depends_on", "monitors"}}, min_claims=1, ) for g in gaps: print(f" {g.entity_id}: missing {g.missing_predicate_types}")
# Register a question db.ingest_inquiry( question="Does the payment service depend on Redis?", subject=("payment-service", "service"), object=("redis", "service"), predicate_hint="depends_on", ) # Later, when new claims arrive, check for matches matches = db.check_inquiry_matches(subject_id="payment-service")
| Mode | Needs API Key? | Best For |
|---|---|---|
"heuristic" | No | Text with explicit relational statements. Fast and free. |
"llm" | Yes | Nuanced text, implicit relationships. 7 providers supported. |
"smart" | Yes | Large volumes. Heuristic first, LLM only for novel content. Saves tokens. |
To use LLM extraction, set the appropriate environment variable
(GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)
and configure the curator:
db.configure_curator("groq") # Free tier available
db = substrate.open("my.db") # Kuzu (default) db = substrate.open("my.db", backend="rust") # Rust (high throughput) db = substrate.open("my.db", backend="auto") # Try Rust, fall back to Kuzu
Both backends have the same API and produce identical results. The Rust backend is faster for bulk ingestion and large datasets.
Full working examples for biomedical research, DevOps, and ML teams.
Every method, parameter, and return type.
What claim-native means and why provenance changes everything.