Benchmarks

All numbers on this page come from automated tests that run in CI. You can reproduce every benchmark yourself — the test files, datasets, and Rust harness are all in the repo.

Engine Performance

Substrate ships two backends with the same API. The Kuzu backend is the default and works everywhere. The Rust backend is built for high-throughput workloads.

OperationKuzu BackendRust BackendSpeedup
Claim ingestion ~36 claims/sec 1.3M claims/sec 36,000x
Entity query ~12ms 8 µs 1,500x
BFS traversal (depth 2) ~12ms 15 µs 800x
Adjacency list build (1K claims) 223 µs
ContextFrame assembly (depth 2) 12ms 12ms 1x (Python overhead)

How we measured

Reproduce it

# Rust microbenchmarks
$ cd rust && cargo bench

# Python performance tests
$ uv run pytest tests/integration/test_performance.py -v

Edge Recovery on Hetionet

Hetionet is a public biomedical knowledge graph with 47,031 nodes and 2.25 million edges across 24 relationship types. We use it to test whether Substrate can predict missing knowledge.

Method

  1. Extract an ego network (~200 entities, ~5,000 edges, avg degree ~37)
  2. Withhold 10% of edges as ground truth
  3. Ingest the remaining 90% into Substrate
  4. Generate structural embeddings (SVD, dim=64)
  5. Score all possible edges using damped random walk on the normalized adjacency matrix
  6. Measure: of the withheld edges, how many appear in the top-K predictions per entity?

Results

MethodMetricResultTarget
Unsupervised (SVD + random walk) Edge recovery recall 17.35% >15%
4-signal ensemble Precision@50 ≥15% or ≥3 hits ≥15% or ≥3 hits
Supervised stacking classifier Precision@50 ≥50% ≥50%
Supervised stacking classifier Edge recovery recall ≥25% ≥25%

Both backends (Kuzu and Rust) produce identical recall on the same holdout. The supervised scorer uses 18 topological features with a stacking classifier (LR + RF).

Why this matters

This isn't an academic exercise. In the biomedical cookbook, bridge predictions on Hetionet identified drug repurposing candidates — 4 out of 10 top predictions were independently validated in published literature, including a 2025 paper on Dasatinib. The other 6 were genuinely novel (not yet in literature).

Reproduce it

$ uv sync --extra science
$ uv run pytest tests/eval/test_hetionet_holdout.py -m slow -s

Downloads Hetionet (~15MB) on first run. Full eval takes ~9 minutes.

Curator Accuracy

The curator triages incoming claims: store, skip, or flag for review. We test this against a set of 250 expert-labeled claims.

MetricResultTarget
Overall accuracy 98% >80%
False positive rate <1%
False negative rate <2%

This is the heuristic curator (no LLM). It runs offline, with zero API calls. LLM-backed curators can achieve higher accuracy on nuanced claims but require an API key.

Reproduce it

$ uv run pytest tests/eval/test_curator_accuracy.py -v

Cross-Language Verification

The Python and Rust backends must produce identical results — same entity IDs, same claim hashes, same content IDs. We verify this with 96 golden test vectors covering entity normalization, claim ID computation, and content ID computation.

What's testedVectorsStatus
Entity normalization (Unicode, Greek, whitespace)32Bit-identical
Claim ID computation (SHA-256)32Bit-identical
Content ID computation (SHA-256)32Bit-identical

Reproduce it

# Generate vectors from Python
$ uv run python scripts/generate_golden_vectors.py

# Verify in Rust
$ cd rust && cargo test

What Only Claim-Native Can Benchmark

Traditional graph database benchmarks (LDBC SNB, etc.) measure query throughput and traversal latency. Those benchmarks don't test the things that make Substrate different, because no other database does them.

Retraction Cascade

Retract a source and every downstream claim is automatically marked as degraded. Corroborated facts survive. Tested end-to-end in test_provenance_cascade.py.

Corroboration Tracking

Same fact from two independent sources? The engine tracks it as corroboration, not a duplicate. Verified in test_symmetric.py.

Time Travel Queries

Query the knowledge base as it existed at any past timestamp. Append-only claim log makes this free. Tested in test_time_travel.py.

Provenance Audit

Every fact traces back to its source. Every extraction traces to the conversation turn. No claim exists without provenance — the engine rejects it.

Test Suite

SuiteTestsRuntime
Python unit + integration~290~60s
Rust unit + golden vectors46~3s
Eval (Hetionet, curator accuracy)6~9 min
# Run everything except slow eval tests
$ uv run pytest tests/unit/ tests/integration/ -q

# Run Rust tests
$ cd rust && cargo test

# Run full eval suite (slow, downloads data)
$ uv run pytest tests/eval/ -m slow -s