Benchmarks — Substrate

Engine Performance

Substrate ships two backends with the same API. The Kuzu backend is the default and works everywhere. The Rust backend is built for high-throughput workloads.

Operation	Kuzu Backend	Rust Backend	Speedup
Claim ingestion	~36 claims/sec	1.3M claims/sec	36,000x
Entity query	~12ms	8 µs	1,500x
BFS traversal (depth 2)	~12ms	15 µs	800x
Adjacency list build (1K claims)	—	223 µs	—
ContextFrame assembly (depth 2)	12ms	12ms	1x (Python overhead)

At a glance

Claim Ingestion (claims/sec)

Kuzu

~36/sec

Rust

1.3M/sec

36,000x faster

Entity Query Latency

Kuzu

12 ms

Rust

8 µs

BFS Traversal (depth 2)

Kuzu

12 ms

Rust

15 µs

How we measured

Rust backend: Criterion benchmarks in rust/substrate-store/benches/store_bench.rs. 1,000 pre-built claims, 100 entities, in-memory store. black_box() prevents compiler optimization.
Kuzu backend: Integration test in tests/integration/test_performance.py. 100 synthetic claims, batch ingestion, depth-2 traversal of highest-degree entity.
ContextFrame: Same on both backends — bottleneck is Python-side assembly, not storage lookup.

Reproduce it

# Rust microbenchmarks
$ cd rust && cargo bench

# Python performance tests
$ uv run pytest tests/integration/test_performance.py -v

Edge Recovery on Hetionet

Hetionet is a public biomedical knowledge graph with 47,031 nodes and 2.25 million edges across 24 relationship types. We use it to test whether Substrate can predict missing knowledge.

Method

Extract an ego network (~200 entities, ~5,000 edges, avg degree ~37)
Withhold 10% of edges as ground truth
Ingest the remaining 90% into Substrate
Generate structural embeddings (SVD, dim=64)
Score all possible edges using damped random walk on the normalized adjacency matrix
Measure: of the withheld edges, how many appear in the top-K predictions per entity?

Results

Method	Metric	Result	Target
Unsupervised (SVD + random walk)	Edge recovery recall	17.35%	>15%
4-signal ensemble	Precision@50	≥15% or ≥3 hits	≥15% or ≥3 hits
Supervised stacking classifier	Precision@50	≥50%	≥50%
Supervised stacking classifier	Edge recovery recall	≥25%	≥25%

Both backends (Kuzu and Rust) produce identical recall on the same holdout. The supervised scorer uses 18 topological features with a stacking classifier (LR + RF).

At a glance

Edge Recovery & Prediction Accuracy

Unsupervised recall

17.35%

Target: >15%

Supervised recall

≥25%

Target: ≥25%

Supervised P@50

≥50%

Target: ≥50%

Curator accuracy

98%

Target: >80%

Why this matters

This isn't an academic exercise. In the biomedical cookbook, bridge predictions on Hetionet identified drug repurposing candidates — 4 out of 10 top predictions were independently validated in published literature, including a 2025 paper on Dasatinib. The other 6 were genuinely novel (not yet in literature).

Reproduce it

$ uv sync --extra science
$ uv run pytest tests/eval/test_hetionet_holdout.py -m slow -s

Downloads Hetionet (~15MB) on first run. Full eval takes ~9 minutes.

Curator Accuracy

The curator triages incoming claims: store, skip, or flag for review. We test this against a set of 250 expert-labeled claims.

Metric	Result	Target
Overall accuracy	98%	>80%
False positive rate	<1%	—
False negative rate	<2%	—

This is the heuristic curator (no LLM). It runs offline, with zero API calls. LLM-backed curators can achieve higher accuracy on nuanced claims but require an API key.

Reproduce it

$ uv run pytest tests/eval/test_curator_accuracy.py -v

Cross-Language Verification

The Python and Rust backends must produce identical results — same entity IDs, same claim hashes, same content IDs. We verify this with 96 golden test vectors covering entity normalization, claim ID computation, and content ID computation.

What's tested	Vectors	Status
Entity normalization (Unicode, Greek, whitespace)	32	Bit-identical
Claim ID computation (SHA-256)	32	Bit-identical
Content ID computation (SHA-256)	32	Bit-identical

Reproduce it

# Generate vectors from Python
$ uv run python scripts/generate_golden_vectors.py

# Verify in Rust
$ cd rust && cargo test

What Only Claim-Native Can Benchmark

Traditional graph database benchmarks (LDBC SNB, etc.) measure query throughput and traversal latency. Those benchmarks don't test the things that make Substrate different, because no other database does them.

Retraction Cascade

Retract a source and every downstream claim is automatically marked as degraded. Corroborated facts survive. Tested end-to-end in test_provenance_cascade.py.

Corroboration Tracking

Same fact from two independent sources? The engine tracks it as corroboration, not a duplicate. Verified in test_symmetric.py.

Time Travel Queries

Query the knowledge base as it existed at any past timestamp. Append-only claim log makes this free. Tested in test_time_travel.py.

Provenance Audit

Every fact traces back to its source. Every extraction traces to the conversation turn. No claim exists without provenance — the engine rejects it.

Test Suite

Suite	Tests	Runtime
Python unit + integration	~290	~60s
Rust unit + golden vectors	46	~3s
Eval (Hetionet, curator accuracy)	6	~9 min

# Run everything except slow eval tests
$ uv run pytest tests/unit/ tests/integration/ -q

# Run Rust tests
$ cd rust && cargo test

# Run full eval suite (slow, downloads data)
$ uv run pytest tests/eval/ -m slow -s

Every number is reproducible

Engine Performance

At a glance

How we measured

Reproduce it

Edge Recovery on Hetionet

Method

Results

At a glance

Why this matters

Reproduce it

Curator Accuracy

Reproduce it

Cross-Language Verification

Reproduce it

What Only Claim-Native Can Benchmark

Retraction Cascade

Corroboration Tracking

Time Travel Queries

Provenance Audit

Test Suite