All numbers on this page come from automated tests that run in CI. You can reproduce every benchmark yourself — the test files, datasets, and Rust harness are all in the repo.
Substrate ships two backends with the same API. The Kuzu backend is the default and works everywhere. The Rust backend is built for high-throughput workloads.
| Operation | Kuzu Backend | Rust Backend | Speedup |
|---|---|---|---|
| Claim ingestion | ~36 claims/sec | 1.3M claims/sec | 36,000x |
| Entity query | ~12ms | 8 µs | 1,500x |
| BFS traversal (depth 2) | ~12ms | 15 µs | 800x |
| Adjacency list build (1K claims) | — | 223 µs | — |
| ContextFrame assembly (depth 2) | 12ms | 12ms | 1x (Python overhead) |
rust/substrate-store/benches/store_bench.rs. 1,000 pre-built claims, 100 entities, in-memory store. black_box() prevents compiler optimization.tests/integration/test_performance.py. 100 synthetic claims, batch ingestion, depth-2 traversal of highest-degree entity.# Rust microbenchmarks $ cd rust && cargo bench # Python performance tests $ uv run pytest tests/integration/test_performance.py -v
Hetionet is a public biomedical knowledge graph with 47,031 nodes and 2.25 million edges across 24 relationship types. We use it to test whether Substrate can predict missing knowledge.
| Method | Metric | Result | Target |
|---|---|---|---|
| Unsupervised (SVD + random walk) | Edge recovery recall | 17.35% | >15% |
| 4-signal ensemble | Precision@50 | ≥15% or ≥3 hits | ≥15% or ≥3 hits |
| Supervised stacking classifier | Precision@50 | ≥50% | ≥50% |
| Supervised stacking classifier | Edge recovery recall | ≥25% | ≥25% |
Both backends (Kuzu and Rust) produce identical recall on the same holdout. The supervised scorer uses 18 topological features with a stacking classifier (LR + RF).
This isn't an academic exercise. In the biomedical cookbook, bridge predictions on Hetionet identified drug repurposing candidates — 4 out of 10 top predictions were independently validated in published literature, including a 2025 paper on Dasatinib. The other 6 were genuinely novel (not yet in literature).
$ uv sync --extra science $ uv run pytest tests/eval/test_hetionet_holdout.py -m slow -s
Downloads Hetionet (~15MB) on first run. Full eval takes ~9 minutes.
The curator triages incoming claims: store, skip, or flag for review. We test this against a set of 250 expert-labeled claims.
| Metric | Result | Target |
|---|---|---|
| Overall accuracy | 98% | >80% |
| False positive rate | <1% | — |
| False negative rate | <2% | — |
This is the heuristic curator (no LLM). It runs offline, with zero API calls. LLM-backed curators can achieve higher accuracy on nuanced claims but require an API key.
$ uv run pytest tests/eval/test_curator_accuracy.py -v
The Python and Rust backends must produce identical results — same entity IDs, same claim hashes, same content IDs. We verify this with 96 golden test vectors covering entity normalization, claim ID computation, and content ID computation.
| What's tested | Vectors | Status |
|---|---|---|
| Entity normalization (Unicode, Greek, whitespace) | 32 | Bit-identical |
| Claim ID computation (SHA-256) | 32 | Bit-identical |
| Content ID computation (SHA-256) | 32 | Bit-identical |
# Generate vectors from Python $ uv run python scripts/generate_golden_vectors.py # Verify in Rust $ cd rust && cargo test
Traditional graph database benchmarks (LDBC SNB, etc.) measure query throughput and traversal latency. Those benchmarks don't test the things that make Substrate different, because no other database does them.
Retract a source and every downstream claim is automatically marked as degraded. Corroborated facts survive. Tested end-to-end in test_provenance_cascade.py.
Same fact from two independent sources? The engine tracks it as corroboration, not a duplicate. Verified in test_symmetric.py.
Query the knowledge base as it existed at any past timestamp. Append-only claim log makes this free. Tested in test_time_travel.py.
Every fact traces back to its source. Every extraction traces to the conversation turn. No claim exists without provenance — the engine rejects it.
| Suite | Tests | Runtime |
|---|---|---|
| Python unit + integration | ~290 | ~60s |
| Rust unit + golden vectors | 46 | ~3s |
| Eval (Hetionet, curator accuracy) | 6 | ~9 min |
# Run everything except slow eval tests $ uv run pytest tests/unit/ tests/integration/ -q # Run Rust tests $ cd rust && cargo test # Run full eval suite (slow, downloads data) $ uv run pytest tests/eval/ -m slow -s