Cookbooks — Substrate

Org

Organization Knowledge Console

Your company's most valuable knowledge lives in Slack threads, email chains, and Google Docs that nobody can find six months later. This cookbook shows how to connect those sources via OAuth, ingest everything into a Substrate knowledge base, and explore it from a browser dashboard.

What it covers

Install & launch the Substrate Console web dashboard
Connect Slack workspace via OAuth (read-only bot token)
Connect Gmail & Google Docs via OAuth (read-only, with refresh token)
Smart extraction: heuristic pre-scan, LLM only for novel content (~$0.06 for a full workspace on Groq)
SSE progress streaming as claims are extracted in real time
Interactive graph visualization with Cytoscape.js (people, teams, projects, decisions, tools)
Natural-language Ask: "Who is responsible for the auth migration?" — answered with claim evidence
Quality dashboard: health score, knowledge gaps, bridge predictions, confidence alerts
Source retraction: disconnect or retract a source, corroborated facts survive

Quick start

$ pip install substrate-console
$ substrate-console my_company.db

This opens http://localhost:8877 in your browser. From there:

Sources → Connect Slack (one click, OAuth V2)
Sources → Connect Google (Gmail + Docs, OAuth 2.0)
Ingest → Select source, pick extraction mode, click Start
Entities → Search and browse everything extracted
Graph → Interactive network visualization
Ask → Type a question, get a grounded answer with evidence

How it works under the hood

# The console does this for you, but you can also do it from Python:
import substrate
from substrate_console.org_vocabulary import register_org_vocabulary

db = substrate.quickstart("my_company.db")
register_org_vocabulary(db)
db.configure_curator("groq")  # Free tier, ~$0.05/M tokens

# Ingest a Slack export
results = db.ingest_slack("team_export.zip", extraction="smart")
for r in results:
    print(f"  #{r.conversation_id}: {r.claims_ingested} claims")

# Or ingest text directly (from an email, a doc, anything)
db.ingest_text(
    "The auth team decided to migrate from JWT to session tokens. "
    "Sarah proposed this and the platform team approved it.",
    source_id="eng-all-hands-2024-02",
)

# Query what you know
frame = db.query("auth-service", depth=2)
print(frame.narrative)

# Find who knows what
for rel in frame.direct_relationships:
    print(f"  {rel.predicate} → {rel.target.name} ({rel.n_independent_sources} sources)")

# Ask natural-language questions
# The Ask page does entity extraction + context assembly + LLM synthesis
# "Who is responsible for the auth migration?"
# → sarah proposed auth-migration, platform team approved auth-migration
#   (2 claims, sources: slack #eng-decisions, eng-all-hands-2024-02)

Org vocabulary

The console registers an organizational vocabulary with entity types and predicates designed for company knowledge:

Entity Types	Predicate Types
`person`, `team`, `channel`, `document`, `email_thread`	`authored`, `mentioned`, `decided`, `discussed_in`
`project`, `decision`, `tool`, `process`, `meeting`	`responsible_for`, `member_of`, `reports_to`, `depends_on`
	`related_to`, `proposed`, `approved`, `uses`

Cost model

Smart extraction on Groq (free tier, ~$0.05/M tokens):

Source	Items	Smart Skip	Est. Cost
10K Slack messages	10,000	60%	$0.02
2K email threads	2,000	50%	$0.025
50 Google Docs	50	30%	$0.004
100 NL queries	100	—	$0.01
Total			$0.06

Heuristic mode (no LLM) is completely free. Good for structured text like meeting notes and project updates.

Dashboard pages

Dashboard

Stats overview: entity/claim counts, source breakdown, health score, quick actions.

Entity Explorer

Search with HTMX typeahead, filter by type and min claims, click through to detail pages.

Graph

Full-screen Cytoscape.js canvas. Node size = claims, edge thickness = confidence. Focus on any entity.

Quality

Knowledge health: multi-source %, confidence alerts, bridge predictions, knowledge gaps.

Ingestion

Pick a source, choose extraction mode, watch SSE progress. Job history with costs.

Ask

Natural-language questions with grounded answers. Evidence panel shows exact claims used.

Biology

Biomedical Research Team

A cancer biology lab uses ChatGPT and Slack daily. This cookbook captures claims from published literature, ChatGPT research sessions, and Slack channel discussions into a unified knowledge base. Then it discovers novel drug targets, finds knowledge gaps, and tracks research questions.

What it covers

Batch ingest structured claims from PubMed, PDB, Reactome, and clinical trials
Corroboration — same fact confirmed by independent sources
Extract claims from a ChatGPT research session on triple-negative breast cancer
Import Slack channels (brca-project, journal-club) from a workspace export
Unified query: BRCA1 aggregates literature + chat + Slack sources
Bridge prediction for drug repurposing candidates
Research question tracking with inquiry matching
Source retraction with provenance cascade

Key code

# One-line setup
db = substrate.quickstart("cancer_lab.db", vocabs=["bio"], curator="heuristic")

# Ingest from multiple sources
db.ingest_batch(literature_claims)                           # PubMed, PDB, Reactome
db.ingest_chat(chatgpt_session, extraction="heuristic")      # ChatGPT conversation
db.ingest_slack("lab_slack.zip", extraction="heuristic")     # Slack workspace

# Query unified knowledge
frame = db.query("BRCA1", depth=2)
print(frame.narrative)

# Discover hidden connections
db.generate_structural_embeddings(dim=32)
bridges = db.find_bridges(top_k=10)

# Track research questions
db.ingest_inquiry(question="Can Olaparib treat breast cancer?",
    subject=("Olaparib", "compound"), object=("Breast Cancer", "disease"))

Sample output

  Knowledge base: 21 entities, 29 claims
  BRCA1: 24 claims, 23 relationships
    --[associated_with]--> Breast Cancer (disease, conf=0.91)
    --[binds]--> RAD51 (protein, conf=0.85)
    --[involved_in]--> DNA Repair Pathway (pathway, conf=0.70)
  Bridge predictions:
    olaparib <-> talazoparib (similarity=0.816)
  Total runtime: 1.98s

Copy full runnable script

#!/usr/bin/env python3
"""Biomedical Research Cookbook — No API keys required. Runs in ~2 seconds."""

from __future__ import annotations
import json, logging, os, sys, tempfile, time, zipfile
logging.disable(logging.WARNING)
import substrate
from substrate import ClaimInput

def section(title):
    print(f"\n{'─'*60}\n  {title}\n{'─'*60}\n")

def main():
    start = time.perf_counter()
    with tempfile.TemporaryDirectory() as tmp:
        db = substrate.quickstart(os.path.join(tmp, "cancer_lab"), vocabs=["bio"], curator="heuristic")
        section("Ingest published literature")
        literature_claims = [
            ClaimInput(subject=("BRCA1","gene"), predicate=("associated_with","associated_with"),
                       object=("Breast Cancer","disease"), provenance={"source_type":"database_import","source_id":"pmid:20301425"}, confidence=0.95),
            ClaimInput(subject=("BRCA2","gene"), predicate=("associated_with","associated_with"),
                       object=("Breast Cancer","disease"), provenance={"source_type":"database_import","source_id":"pmid:20301425"}, confidence=0.93),
            ClaimInput(subject=("BRCA1","gene"), predicate=("associated_with","associated_with"),
                       object=("Ovarian Cancer","disease"), provenance={"source_type":"database_import","source_id":"pmid:24677121"}, confidence=0.91),
            ClaimInput(subject=("BRCA1","protein"), predicate=("binds","binds"),
                       object=("RAD51","protein"), provenance={"source_type":"experimental","source_id":"pdb:1n0w"}, confidence=0.92),
            ClaimInput(subject=("TP53","protein"), predicate=("interacts","interacts"),
                       object=("BRCA1","protein"), provenance={"source_type":"experimental","source_id":"pmid:19837678"}, confidence=0.88),
            ClaimInput(subject=("BRCA1","gene"), predicate=("involved_in","involved_in"),
                       object=("DNA Repair Pathway","pathway"), provenance={"source_type":"database_import","source_id":"reactome:R-HSA-73894"}, confidence=0.94),
            ClaimInput(subject=("RAD51","protein"), predicate=("involved_in","involved_in"),
                       object=("Homologous Recombination","pathway"), provenance={"source_type":"database_import","source_id":"reactome:R-HSA-5693532"}, confidence=0.96),
            ClaimInput(subject=("Tamoxifen","compound"), predicate=("treats","treats"),
                       object=("Breast Cancer","disease"), provenance={"source_type":"clinical_trial","source_id":"nct:00003140"}, confidence=0.97),
            ClaimInput(subject=("Olaparib","compound"), predicate=("treats","treats"),
                       object=("Ovarian Cancer","disease"), provenance={"source_type":"clinical_trial","source_id":"nct:01874353"}, confidence=0.89),
            ClaimInput(subject=("Olaparib","compound"), predicate=("inhibits","inhibits"),
                       object=("PARP1","protein"), provenance={"source_type":"experimental","source_id":"pmid:16912195"}, confidence=0.95),
            ClaimInput(subject=("PARP1","protein"), predicate=("involved_in","involved_in"),
                       object=("DNA Repair Pathway","pathway"), provenance={"source_type":"database_import","source_id":"reactome:R-HSA-73894"}, confidence=0.93),
            ClaimInput(subject=("Tamoxifen","compound"), predicate=("inhibits","inhibits"),
                       object=("Estrogen Receptor","protein"), provenance={"source_type":"experimental","source_id":"pmid:15928335"}, confidence=0.93),
            ClaimInput(subject=("Estrogen Receptor","protein"), predicate=("associated_with","associated_with"),
                       object=("Breast Cancer","disease"), provenance={"source_type":"experimental","source_id":"pmid:18202748"}, confidence=0.94),
        ]
        batch = db.ingest_batch(literature_claims)
        print(f"  Ingested {batch.ingested} claims from literature")
        # Corroboration
        db.ingest(subject=("BRCA1","gene"), predicate=("associated_with","associated_with"),
                  object=("Breast Cancer","disease"), provenance={"source_type":"database_import","source_id":"disgenet:C0006142"}, confidence=0.90)
        print(f"  Corroboration: {len(db.claims_by_content_id(db.claims_for('brca1')[0].content_id))} sources confirm BRCA1 ~ Breast Cancer")

        section("Extract from ChatGPT")
        result = db.ingest_chat([
            {"role":"user","content":"What genes are mutated in triple-negative breast cancer?"},
            {"role":"assistant","content":"BRCA1 is associated with Triple Negative Breast Cancer. TP53 is associated with Triple Negative Breast Cancer. PIK3CA is associated with Triple Negative Breast Cancer. PTEN is associated with Triple Negative Breast Cancer."},
            {"role":"user","content":"Treatment options for BRCA-mutated TNBC?"},
            {"role":"assistant","content":"Olaparib treats Triple Negative Breast Cancer. Talazoparib treats Triple Negative Breast Cancer. Talazoparib inhibits PARP1. Carboplatin treats Triple Negative Breast Cancer. Pembrolizumab treats Triple Negative Breast Cancer."},
        ], conversation_id="tnbc-research", extraction="heuristic")
        print(f"  {result.claims_ingested} claims from ChatGPT session")

        section("Import Slack channels")
        slack_zip = os.path.join(tmp, "lab_slack.zip")
        with zipfile.ZipFile(slack_zip, "w") as zf:
            zf.writestr("channels.json", json.dumps([{"id":"C1","name":"brca-project"},{"id":"C2","name":"journal-club"}]))
            zf.writestr("users.json", json.dumps([{"id":"U1","name":"sarah","profile":{"real_name":"Sarah"}}]))
            zf.writestr("brca-project/2024-01-20.json", json.dumps([
                {"type":"message","user":"U1","text":"PALB2 connection to BRCA2?","ts":"1705700000.000"},
                {"type":"message","bot_id":"B1","text":"PALB2 binds BRCA2. PALB2 is associated with Breast Cancer. PALB2 is involved in Homologous Recombination.","ts":"1705700060.000"},
            ]))
            zf.writestr("journal-club/2024-01-22.json", json.dumps([
                {"type":"message","user":"U1","text":"ATM inhibitor paper","ts":"1705900000.000"},
                {"type":"message","bot_id":"B1","text":"ATM is involved in DNA Repair Pathway. AZD0156 inhibits ATM. ATM interacts with BRCA1.","ts":"1705900060.000"},
            ]))
        slack_results = db.ingest_slack(slack_zip, extraction="heuristic")
        print(f"  {sum(r.claims_ingested for r in slack_results)} claims from Slack")

        section("Query unified knowledge")
        stats = db.stats()
        print(f"  {stats['entity_count']} entities, {stats['total_claims']} claims")
        frame = db.query("BRCA1", depth=2)
        print(f"  BRCA1: {frame.claim_count} claims, {len(frame.direct_relationships)} relationships")
        for rel in frame.direct_relationships[:8]:
            print(f"    --[{rel.predicate}]--> {rel.target.name} ({rel.target.entity_type}, conf={rel.confidence:.2f})")

        section("Discover hidden connections")
        db.generate_structural_embeddings(dim=32)
        for b in db.find_bridges(top_k=5):
            print(f"    {b.entity_a} <-> {b.entity_b} (similarity={b.similarity:.3f})")
        gaps = db.find_gaps({"gene":{"associated_with","involved_in"},"compound":{"treats","inhibits"}}, min_claims=1)
        print(f"  {len(gaps)} knowledge gaps found")

        section("Research questions")
        db.ingest_inquiry(question="Can Olaparib treat breast cancer?", subject=("Olaparib","compound"), object=("Breast Cancer","disease"), predicate_hint="treats")
        db.ingest(subject=("Olaparib","compound"), predicate=("treats","treats"), object=("Breast Cancer","disease"),
                  provenance={"source_type":"clinical_trial","source_id":"nct:02000622"}, confidence=0.88)
        print(f"  Inquiry matches: {len(db.check_inquiry_matches(subject_id='Olaparib', object_id='Breast Cancer'))}")

        section("Source retraction")
        cascade = db.retract_cascade("pmid:20301425", reason="Data fabrication")
        print(f"  Retracted: {cascade.source_retract.retracted_count} claims")
        print(f"  Downstream degraded: {cascade.degraded_count}")
        active = [c for c in db.claims_for("brca1") if c.status.name == "ACTIVE"]
        print(f"  BRCA1 still has {len(active)} active claims (corroboration preserved)")
        db.close()
    print(f"\n  Total runtime: {time.perf_counter() - start:.2f}s")
    return 0

if __name__ == "__main__":
    sys.exit(main())

DevOps

DevOps Knowledge Base

An infrastructure team models service dependencies from Kubernetes manifests, captures incident response claims from ChatGPT debugging sessions, and pulls tribal knowledge from Slack ops channels. Query "what depends on Redis?" — answered with every source that says so.

What it covers

Model architecture from K8s manifests (dependencies, ownership, monitoring)
Extract from an incident response ChatGPT session (Redis outage blast radius)
Import ops-incidents and architecture Slack channels
Dependency query: "What depends on Redis Cache?"
Impact analysis: "Who gets paged if Postgres goes down?"
Path analysis between monitoring and databases
Post-incident research question tracking

Key code

db = substrate.quickstart("infra.db", vocabs=["devops"], curator="heuristic")

# Model architecture
db.ingest_batch(architecture_claims)     # K8s manifests, org chart, monitoring
db.ingest_chat(incident_chat, ...)       # Post-incident ChatGPT sessions
db.ingest_slack("ops_slack.zip", ...)     # Slack ops channels

# Impact analysis
redis_frame = db.query("Redis Cache", depth=2)
has_path = db.path_exists("PagerDuty Alert", "PostgreSQL", max_depth=3)

Copy full runnable script

#!/usr/bin/env python3
"""DevOps Knowledge Base Cookbook — No API keys required. Runs in ~2 seconds."""

from __future__ import annotations
import json, logging, os, sys, tempfile, time, zipfile
logging.disable(logging.WARNING)
import substrate
from substrate import ClaimInput

def section(title):
    print(f"\n{'─'*60}\n  {title}\n{'─'*60}\n")

def main():
    start = time.perf_counter()
    with tempfile.TemporaryDirectory() as tmp:
        db = substrate.quickstart(os.path.join(tmp, "infra_kb"), vocabs=["devops"], curator="heuristic")

        section("Model service architecture")
        batch = db.ingest_batch([
            ClaimInput(subject=("API Gateway","service"), predicate=("depends_on","depends_on"), object=("Auth Service","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:api-gateway"}, confidence=1.0),
            ClaimInput(subject=("API Gateway","service"), predicate=("depends_on","depends_on"), object=("User Service","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:api-gateway"}, confidence=1.0),
            ClaimInput(subject=("API Gateway","service"), predicate=("depends_on","depends_on"), object=("Redis Cache","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:api-gateway"}, confidence=1.0),
            ClaimInput(subject=("Auth Service","service"), predicate=("depends_on","depends_on"), object=("PostgreSQL","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:auth-service"}, confidence=1.0),
            ClaimInput(subject=("Auth Service","service"), predicate=("depends_on","depends_on"), object=("Redis Cache","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:auth-service"}, confidence=1.0),
            ClaimInput(subject=("User Service","service"), predicate=("depends_on","depends_on"), object=("PostgreSQL","service"),
                       provenance={"source_type":"config_management","source_id":"k8s:user-service"}, confidence=1.0),
            ClaimInput(subject=("Datadog Agent","service"), predicate=("monitors","monitors"), object=("API Gateway","service"),
                       provenance={"source_type":"monitoring","source_id":"datadog:config"}, confidence=0.95),
            ClaimInput(subject=("Datadog Agent","service"), predicate=("monitors","monitors"), object=("PostgreSQL","service"),
                       provenance={"source_type":"monitoring","source_id":"datadog:config"}, confidence=0.95),
            ClaimInput(subject=("PagerDuty Alert","alert"), predicate=("monitors","monitors"), object=("Redis Cache","service"),
                       provenance={"source_type":"monitoring","source_id":"pagerduty:redis-alert"}, confidence=0.9),
            ClaimInput(subject=("Platform Team","team"), predicate=("owns","owns"), object=("API Gateway","service"),
                       provenance={"source_type":"config_management","source_id":"org-chart"}, confidence=1.0),
            ClaimInput(subject=("Platform Team","team"), predicate=("owns","owns"), object=("Auth Service","service"),
                       provenance={"source_type":"config_management","source_id":"org-chart"}, confidence=1.0),
            ClaimInput(subject=("Data Team","team"), predicate=("owns","owns"), object=("PostgreSQL","service"),
                       provenance={"source_type":"config_management","source_id":"org-chart"}, confidence=1.0),
            ClaimInput(subject=("Platform Team","team"), predicate=("owns","owns"), object=("Redis Cache","service"),
                       provenance={"source_type":"config_management","source_id":"org-chart"}, confidence=1.0),
        ])
        print(f"  {batch.ingested} architecture claims")

        section("Incident response chat")
        result = db.ingest_chat([
            {"role":"user","content":"Redis Cache outage at 3am. API Gateway returning 502s. Blast radius?"},
            {"role":"assistant","content":"API Gateway depends on Redis Cache for session caching. Auth Service depends on Redis Cache for token storage. PagerDuty Alert monitors Redis Cache. The runbook RB-REDIS-001 mitigates Redis Cache outages."},
            {"role":"user","content":"How to prevent this?"},
            {"role":"assistant","content":"Redis Sentinel monitors Redis Cache for failover. Platform Team owns Redis Cache. Datadog Agent monitors API Gateway. Consider a circuit breaker in API Gateway that falls back to PostgreSQL."},
        ], conversation_id="inc-2024-redis-outage", extraction="heuristic")
        print(f"  {result.claims_ingested} claims from incident chat")

        section("Slack ops channels")
        slack_zip = os.path.join(tmp, "ops_slack.zip")
        with zipfile.ZipFile(slack_zip, "w") as zf:
            zf.writestr("channels.json", json.dumps([{"id":"C1","name":"ops-incidents"},{"id":"C2","name":"architecture"}]))
            zf.writestr("users.json", json.dumps([{"id":"U1","name":"eng","profile":{"real_name":"Engineer"}}]))
            zf.writestr("ops-incidents/2024-02-01.json", json.dumps([
                {"type":"message","user":"U1","text":"PostgreSQL connection limits again","ts":"1706800000.000"},
                {"type":"message","bot_id":"B1","text":"Auth Service depends on PostgreSQL. User Service depends on PostgreSQL. Data Team owns PostgreSQL.","ts":"1706800060.000"},
            ]))
            zf.writestr("architecture/2024-02-05.json", json.dumps([
                {"type":"message","user":"U1","text":"Redis to KeyDB migration impact?","ts":"1707100000.000"},
                {"type":"message","bot_id":"B1","text":"Redis Cache is used by API Gateway for rate limiting. Auth Service depends on Redis Cache for JWT blacklisting.","ts":"1707100060.000"},
            ]))
        slack_results = db.ingest_slack(slack_zip, extraction="heuristic")
        print(f"  {sum(r.claims_ingested for r in slack_results)} claims from Slack")

        section("What depends on Redis?")
        redis_frame = db.query("Redis Cache", depth=2)
        print(f"  Redis Cache: {redis_frame.claim_count} claims")
        for rel in redis_frame.direct_relationships:
            print(f"    {rel.predicate} <-- {rel.target.name}")

        section("Postgres blast radius")
        pg_frame = db.query("PostgreSQL", depth=2)
        deps = [r for r in pg_frame.direct_relationships if r.predicate == "depends_on"]
        print(f"  {len(deps)} services depend on PostgreSQL")
        print(f"  PagerDuty -> PostgreSQL path: {db.path_exists('PagerDuty Alert', 'PostgreSQL', max_depth=3)}")

        section("Post-incident questions")
        db.ingest_inquiry(question="Does Redis Sentinel monitor Redis Cache?",
                          subject=("Redis Sentinel","service"), object=("Redis Cache","service"), predicate_hint="monitors")
        print(f"  Open inquiries: {len(db.open_inquiries())}")

        final = db.stats()
        print(f"\n  Final: {final['entity_count']} entities, {final['total_claims']} claims")
        db.close()
    print(f"  Runtime: {time.perf_counter() - start:.2f}s")
    return 0

if __name__ == "__main__":
    sys.exit(main())

ML Experiment Tracker

An ML team tracks experiments, model comparisons, and feature engineering decisions. Register experiment results as claims with provenance, extract insights from model discussion chats, and query "what was tried for churn prediction?" — every answer traces back to its source.

What it covers

Register experiments: models, datasets, features, performance comparisons
Extract from a model comparison ChatGPT session (XGBoost vs LightGBM vs Neural Net)
Import Slack #ml-experiments channel with ensemble results
Query "What was tried for churn prediction?" — shows all models trained on dataset
Feature importance across experiments
Model lineage tracing (XGBoost V2 derived from V1)
Knowledge gap detection (models missing evaluated_on)

Key code

db = substrate.quickstart("ml.db", vocabs=["ml"], curator="heuristic")

# Track experiments with performance payloads
db.ingest(
    subject=("XGBoost V2", "model"),
    predicate=("trained_on", "trained_on"),
    object=("Churn Dataset Q4", "dataset"),
    provenance={"source_type": "experiment_log", "source_id": "exp:003"},
    payload={"accuracy": 0.87, "auc": 0.90},
)

# Query what was tried
frame = db.query("Churn Dataset Q4", depth=2)

# Model lineage
claims = db.claims_for("xgboost v2", predicate_type="derived_from")

Copy full runnable script

#!/usr/bin/env python3
"""ML Experiment Tracker Cookbook — No API keys required. Runs in ~2 seconds."""

from __future__ import annotations
import json, logging, os, sys, tempfile, time, zipfile
logging.disable(logging.WARNING)
import substrate
from substrate import ClaimInput

def section(title):
    print(f"\n{'─'*60}\n  {title}\n{'─'*60}\n")

def main():
    start = time.perf_counter()
    with tempfile.TemporaryDirectory() as tmp:
        db = substrate.quickstart(os.path.join(tmp, "ml_tracker"), vocabs=["ml"], curator="heuristic")

        section("Register experiments")
        batch = db.ingest_batch([
            ClaimInput(subject=("XGBoost V1","model"), predicate=("trained_on","trained_on"), object=("Churn Dataset Q4","dataset"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-001"}, confidence=1.0, payload={"accuracy":0.82,"auc":0.85}),
            ClaimInput(subject=("XGBoost V1","model"), predicate=("uses_feature","uses_feature"), object=("Tenure Months","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-001"}, confidence=0.95),
            ClaimInput(subject=("XGBoost V1","model"), predicate=("uses_feature","uses_feature"), object=("Monthly Charges","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-001"}, confidence=0.92),
            ClaimInput(subject=("XGBoost V1","model"), predicate=("uses_feature","uses_feature"), object=("Contract Type","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-001"}, confidence=0.88),
            ClaimInput(subject=("Random Forest V1","model"), predicate=("trained_on","trained_on"), object=("Churn Dataset Q4","dataset"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-002"}, confidence=1.0, payload={"accuracy":0.79,"auc":0.81}),
            ClaimInput(subject=("XGBoost V1","model"), predicate=("outperforms","outperforms"), object=("Random Forest V1","model"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-compare-001"}, confidence=0.88),
            ClaimInput(subject=("XGBoost V2","model"), predicate=("trained_on","trained_on"), object=("Churn Dataset Q4","dataset"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-003"}, confidence=1.0, payload={"accuracy":0.87,"auc":0.90}),
            ClaimInput(subject=("XGBoost V2","model"), predicate=("derived_from","derived_from"), object=("XGBoost V1","model"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-003"}, confidence=1.0),
            ClaimInput(subject=("XGBoost V2","model"), predicate=("uses_feature","uses_feature"), object=("Support Tickets 30D","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-003"}, confidence=0.96),
            ClaimInput(subject=("XGBoost V2","model"), predicate=("uses_feature","uses_feature"), object=("NPS Score","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-003"}, confidence=0.91),
            ClaimInput(subject=("XGBoost V2","model"), predicate=("outperforms","outperforms"), object=("XGBoost V1","model"),
                       provenance={"source_type":"experiment_log","source_id":"exp:churn-compare-002"}, confidence=0.92),
            ClaimInput(subject=("Neural Net V1","model"), predicate=("trained_on","trained_on"), object=("Fraud Dataset 2024","dataset"),
                       provenance={"source_type":"experiment_log","source_id":"exp:fraud-001"}, confidence=1.0, payload={"accuracy":0.94,"f1":0.72}),
            ClaimInput(subject=("Neural Net V1","model"), predicate=("uses_feature","uses_feature"), object=("Transaction Amount","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:fraud-001"}, confidence=0.97),
            ClaimInput(subject=("Neural Net V1","model"), predicate=("uses_feature","uses_feature"), object=("Merchant Category","feature"),
                       provenance={"source_type":"experiment_log","source_id":"exp:fraud-001"}, confidence=0.85),
            ClaimInput(subject=("Fraud Dataset 2024","dataset"), predicate=("derived_from","derived_from"), object=("Fraud Dataset 2023","dataset"),
                       provenance={"source_type":"experiment_log","source_id":"data:fraud-v2"}, confidence=1.0),
        ])
        print(f"  {batch.ingested} experiment claims")

        section("ChatGPT model discussion")
        result = db.ingest_chat([
            {"role":"user","content":"XGBoost V2 is at 0.90 AUC. Should we try LightGBM?"},
            {"role":"assistant","content":"XGBoost V2 outperforms Random Forest V1. LightGBM Churn V1 was trained on Churn Dataset Q4 with 0.88 AUC. XGBoost V2 outperforms LightGBM Churn V1. Key features: Support Tickets 30D and NPS Score."},
            {"role":"user","content":"What about fraud detection?"},
            {"role":"assistant","content":"Neural Net V1 was evaluated on Fraud Dataset 2024 with 0.72 F1. Isolation Forest V1 was trained on Fraud Dataset 2024 as baseline. Neural Net V1 outperforms Isolation Forest V1 on precision."},
        ], conversation_id="ml-review-weekly", extraction="heuristic")
        print(f"  {result.claims_ingested} claims from chat")

        section("Slack #ml-experiments")
        slack_zip = os.path.join(tmp, "ml_slack.zip")
        with zipfile.ZipFile(slack_zip, "w") as zf:
            zf.writestr("channels.json", json.dumps([{"id":"C1","name":"ml-experiments"}]))
            zf.writestr("users.json", json.dumps([{"id":"U1","name":"ds","profile":{"real_name":"Data Scientist"}}]))
            zf.writestr("ml-experiments/2024-03-01.json", json.dumps([
                {"type":"message","user":"U1","text":"Trying ensemble for churn","ts":"1709300000.000"},
                {"type":"message","bot_id":"B1","text":"Stacked Ensemble V1 was trained on Churn Dataset Q4. Stacked Ensemble V1 outperforms XGBoost V2 with 0.92 AUC. Stacked Ensemble V1 uses feature Contract Type. Stacked Ensemble V1 was derived from XGBoost V2.","ts":"1709300060.000"},
            ]))
        slack_results = db.ingest_slack(slack_zip, extraction="heuristic")
        print(f"  {sum(r.claims_ingested for r in slack_results)} claims from Slack")

        section("What was tried for churn prediction?")
        churn_frame = db.query("Churn Dataset Q4", depth=2)
        print(f"  Models trained on Churn Dataset Q4:")
        for rel in churn_frame.direct_relationships:
            if rel.predicate == "trained_on":
                print(f"    {rel.target.name}")

        section("Feature importance")
        for feat in db.list_entities(entity_type="feature"):
            claims = db.claims_for(feat.id, predicate_type="uses_feature")
            if claims:
                models = set(c.subject.display_name or c.subject.id for c in claims)
                print(f"    {feat.name}: {len(models)} model(s)")

        section("Model lineage")
        for c in db.claims_for("xgboost v2", predicate_type="derived_from"):
            print(f"    XGBoost V2 derived from {c.object.display_name or c.object.id}")

        final = db.stats()
        print(f"\n  Final: {final['entity_count']} entities, {final['total_claims']} claims")
        db.close()
    print(f"  Runtime: {time.perf_counter() - start:.2f}s")
    return 0

if __name__ == "__main__":
    sys.exit(main())

Four domains, same engine

Company Knowledge

Drug Repurposing

Incident Knowledge

Experiment Lineage

Organization Knowledge Console

What it covers

Quick start

How it works under the hood

Org vocabulary

Cost model

Dashboard pages

Dashboard

Entity Explorer

Graph

Quality

Ingestion

Ask

Biomedical Research Team

What it covers

Key code

Sample output

DevOps Knowledge Base

What it covers

Key code

ML Experiment Tracker

What it covers

Key code