Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Recall Benchmarks (Phase 2)

> Status: Suite B (token efficiency) implemented. Suites A / C / D / E are scaffolded in the runner but not yet built. See `.atlas/plans/2026-04-17-mempalace-research-borrow-list.md` for the full Phase 2 design.
> Status: Suite B (token efficiency) and Suite C (precision under noise) implemented. Suites A / D / E are scaffolded in the runner but not yet built. See `.atlas/plans/2026-04-17-mempalace-research-borrow-list.md` for the full Phase 2 design.

## Why this exists

Expand Down Expand Up @@ -51,10 +51,23 @@ Each run writes two files to `benchmarks/results/`:
|---|---|---|---|
| A | Cross-session recall | Planned | Retrieval@5 + answer accuracy across N-session synthetic gaps |
| B | Token efficiency | **Built** | Wake-up bundle char/token cost vs v1 baseline and CLAUDE.md |
| C | Precision under noise | Planned | Precision@5 and latency at corpus sizes 100 / 1k / 10k / 100k |
| C | Precision under noise | **Built** | P@5 / R@5 / MRR@5 + latency p50/p95 at corpus sizes 100 / 1k / 10k / 100k |
| D | Structured-knowledge fidelity | Planned | Supersession correctness, LoA elevation in mixed results |
| E | Real-world replay | Planned | Help-rate and wrong-direction-rate on anonymized session history |

## Suite C methodology — precision under noise

Suite C answers one question: **when the database is full of junk, does `search()` still surface the right record?**

- **Corpus.** For each size in the ladder (default 100 / 1,000 / 10,000 / 100,000 records), a synthetic corpus is built in a temporary DB using the real schema (`initDb()`) and the real write paths from `src/lib/memory.ts`, so FTS triggers populate exactly as in production. The user's real DB is never touched.
- **Determinism.** Fixture generation is seeded (mulberry32, default seed 47). The same seed and size produce byte-identical record content, so runs are comparable across machines and over time. Tests assert this.
- **Ground truth.** A fixed set of target records (constant across sizes) carries labels: table, project, and provenance. The rest of the corpus is noise in three roles: near-duplicates of targets (the precision trap), entity-name collisions, and low-signal filler. Noise spans all five searchable tables, including messages.
- **Queries.** Four labeled categories: exact project/name lookup, paraphrased decision lookup, learning/problem lookup, and noisy ambiguous queries. Ambiguous queries carry explicit collision labels (name / project / topic) so failures can be attributed to entity ambiguity vs generic ranking noise.
- **Metrics.** Precision@5, Recall@5, and MRR@5 per corpus size, plus breakdowns by query category, by ground-truth table (`r_at_5_table_*`), and by provenance (`r_at_5_prov_*`). No composite scores, per the methodology rules.
- **Latency.** One unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. The report caveats state the protocol and whether the embedding service was available — Suite C exercises the FTS5 keyword path only.
- **Baseline-first.** The first run records an honest baseline; there is no pass/fail threshold. Later regression gating can diff runs against the checked-in baseline JSONL in `benchmarks/results/`.
- **Overrides.** `RECALL_BENCH_C_SIZES` (comma-separated) and `RECALL_BENCH_C_REPEATS` override the corpus ladder and repeat count — used by tests to keep CI fast; leave unset for comparable real runs.

## Adding a new suite

1. Create `benchmarks/suites/suite-<id>-<name>.ts` exporting `runSuite<id>(): Promise<SuiteResult>`.
Expand Down
1 change: 1 addition & 0 deletions benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"startedAt":"2026-06-11T09:36:48.827Z","finishedAt":"2026-06-11T09:36:53.977Z","recallVersion":"1.0.0","hostInfo":{"platform":"darwin-arm64","bunVersion":"1.3.14"},"suites":[{"suite":"C","name":"Precision under noise","description":"Measures FTS5 search() precision against seeded synthetic corpora (sizes: 100, 1000, 10000, 100000; seed: 47). A ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions) runs at each size; reports P@5, R@5, MRR@5, latency p50/p95, and breakdowns by query category, target table, and provenance.","ranAt":"2026-06-11T09:36:53.977Z","durationMs":5149,"samples":[{"name":"p_at_5","value":0.1857,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5","value":0.8095,"unit":"ratio","scope":"corpus=100"},{"name":"mrr","value":0.7381,"unit":"ratio (MRR@5)","scope":"corpus=100"},{"name":"latency_p50_ms","value":0.338,"unit":"ms","scope":"corpus=100"},{"name":"latency_p95_ms","value":0.475,"unit":"ms","scope":"corpus=100"},{"name":"p_at_5_cat_exact_lookup","value":0.2,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_exact_lookup","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_paraphrase","value":0.0667,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_paraphrase","value":0.3333,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_problem_lookup","value":0.2,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_problem_lookup","value":0.8333,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_ambiguous","value":0.25,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_ambiguous","value":0.7083,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_decisions","value":0.625,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_loa_entries","value":0.5,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_learnings","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_user_authored","value":0.75,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_extracted","value":0.6667,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_verbatim","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_derived","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5","value":0.0714,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5","value":0.3571,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr","value":0.2857,"unit":"ratio (MRR@5)","scope":"corpus=1000"},{"name":"latency_p50_ms","value":0.442,"unit":"ms","scope":"corpus=1000"},{"name":"latency_p95_ms","value":0.714,"unit":"ms","scope":"corpus=1000"},{"name":"p_at_5_cat_exact_lookup","value":0.15,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_exact_lookup","value":0.75,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_paraphrase","value":0.0667,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_paraphrase","value":0.1667,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_ambiguous","value":0.05,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_ambiguous","value":0.125,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_decisions","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_user_authored","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_extracted","value":0.2222,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_verbatim","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5","value":0.0429,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5","value":0.2143,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr","value":0.1095,"unit":"ratio (MRR@5)","scope":"corpus=10000"},{"name":"latency_p50_ms","value":1.246,"unit":"ms","scope":"corpus=10000"},{"name":"latency_p95_ms","value":5.236,"unit":"ms","scope":"corpus=10000"},{"name":"p_at_5_cat_exact_lookup","value":0.1,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_exact_lookup","value":0.1333,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_ambiguous","value":0.05,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_ambiguous","value":0.25,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_decisions","value":0.25,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_user_authored","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_extracted","value":0.2222,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_verbatim","value":0.5,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr","value":0,"unit":"ratio (MRR@5)","scope":"corpus=100000"},{"name":"latency_p50_ms","value":2.626,"unit":"ms","scope":"corpus=100000"},{"name":"latency_p95_ms","value":36.182,"unit":"ms","scope":"corpus=100000"},{"name":"p_at_5_cat_exact_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_exact_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_ambiguous","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_ambiguous","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_decisions","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_breadcrumbs","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_user_authored","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_extracted","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_verbatim","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=100000"}],"caveats":["Synthetic corpus: deterministic seeded fixtures (seed 47). Absolute scores do not transfer to real-world corpora; compare runs only against this same fixture set.","Latency protocol: 1 unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. Relevance metrics come from the first measured pass (retrieval is deterministic for a fixed corpus).","Embedding service available: no. Suite C exercises the FTS5 keyword path (search()) only — semantic/hybrid retrieval is NOT measured in this baseline either way.","FTS5 MATCH is implicit AND with no stemming — paraphrase-category queries are expected to score near zero on keyword search. That gap is part of the honest baseline this suite records.","Dedup was NOT run before measurement: the corpus contains unmarked near-duplicates that legitimately compete in ranking. search() excludes only records already marked in dedup_lineage.","Ground truth never includes messages-table records — messages are noise-only in this corpus. The project column is part of every FTS index, so unscoped queries can match records via their project name alone.","No pass/fail threshold — baseline-first. Later regression gating can diff future runs against the checked-in baseline JSONL."]}]}
112 changes: 112 additions & 0 deletions benchmarks/results/2026-06-11T09-36-53-suite-C.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Recall Benchmark Run

- **Started:** 2026-06-11T09:36:48.827Z
- **Finished:** 2026-06-11T09:36:53.977Z
- **Recall version:** 1.0.0
- **Host:** darwin-arm64 (Bun 1.3.14)

## Suite C — Precision under noise

Measures FTS5 search() precision against seeded synthetic corpora (sizes: 100, 1000, 10000, 100000; seed: 47). A ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions) runs at each size; reports P@5, R@5, MRR@5, latency p50/p95, and breakdowns by query category, target table, and provenance.

_Ran in 5149 ms at 2026-06-11T09:36:53.977Z._

| Metric | Value | Unit | Scope | vs Baseline |
|---|---:|---|---|---|
| p_at_5 | 0.1857 | ratio | corpus=100 | — |
| r_at_5 | 0.8095 | ratio | corpus=100 | — |
| mrr | 0.7381 | ratio (MRR@5) | corpus=100 | — |
| latency_p50_ms | 0.338 | ms | corpus=100 | — |
| latency_p95_ms | 0.475 | ms | corpus=100 | — |
| p_at_5_cat_exact_lookup | 0.2 | ratio | corpus=100 | — |
| mrr_cat_exact_lookup | 1 | ratio | corpus=100 | — |
| p_at_5_cat_paraphrase | 0.0667 | ratio | corpus=100 | — |
| mrr_cat_paraphrase | 0.3333 | ratio | corpus=100 | — |
| p_at_5_cat_problem_lookup | 0.2 | ratio | corpus=100 | — |
| mrr_cat_problem_lookup | 0.8333 | ratio | corpus=100 | — |
| p_at_5_cat_ambiguous | 0.25 | ratio | corpus=100 | — |
| mrr_cat_ambiguous | 0.7083 | ratio | corpus=100 | — |
| r_at_5_table_decisions | 0.625 | ratio | corpus=100 | — |
| r_at_5_table_loa_entries | 0.5 | ratio | corpus=100 | — |
| r_at_5_table_breadcrumbs | 1 | ratio | corpus=100 | — |
| r_at_5_table_learnings | 1 | ratio | corpus=100 | — |
| r_at_5_prov_user_authored | 0.75 | ratio | corpus=100 | — |
| r_at_5_prov_extracted | 0.6667 | ratio | corpus=100 | — |
| r_at_5_prov_verbatim | 1 | ratio | corpus=100 | — |
| r_at_5_prov_derived | 1 | ratio | corpus=100 | — |
| p_at_5 | 0.0714 | ratio | corpus=1000 | — |
| r_at_5 | 0.3571 | ratio | corpus=1000 | — |
| mrr | 0.2857 | ratio (MRR@5) | corpus=1000 | — |
| latency_p50_ms | 0.442 | ms | corpus=1000 | — |
| latency_p95_ms | 0.714 | ms | corpus=1000 | — |
| p_at_5_cat_exact_lookup | 0.15 | ratio | corpus=1000 | — |
| mrr_cat_exact_lookup | 0.75 | ratio | corpus=1000 | — |
| p_at_5_cat_paraphrase | 0.0667 | ratio | corpus=1000 | — |
| mrr_cat_paraphrase | 0.1667 | ratio | corpus=1000 | — |
| p_at_5_cat_problem_lookup | 0 | ratio | corpus=1000 | — |
| mrr_cat_problem_lookup | 0 | ratio | corpus=1000 | — |
| p_at_5_cat_ambiguous | 0.05 | ratio | corpus=1000 | — |
| mrr_cat_ambiguous | 0.125 | ratio | corpus=1000 | — |
| r_at_5_table_decisions | 0.5 | ratio | corpus=1000 | — |
| r_at_5_table_loa_entries | 0 | ratio | corpus=1000 | — |
| r_at_5_table_breadcrumbs | 1 | ratio | corpus=1000 | — |
| r_at_5_table_learnings | 0 | ratio | corpus=1000 | — |
| r_at_5_prov_user_authored | 0.5 | ratio | corpus=1000 | — |
| r_at_5_prov_extracted | 0.2222 | ratio | corpus=1000 | — |
| r_at_5_prov_verbatim | 0.5 | ratio | corpus=1000 | — |
| r_at_5_prov_derived | 0 | ratio | corpus=1000 | — |
| p_at_5 | 0.0429 | ratio | corpus=10000 | — |
| r_at_5 | 0.2143 | ratio | corpus=10000 | — |
| mrr | 0.1095 | ratio (MRR@5) | corpus=10000 | — |
| latency_p50_ms | 1.246 | ms | corpus=10000 | — |
| latency_p95_ms | 5.236 | ms | corpus=10000 | — |
| p_at_5_cat_exact_lookup | 0.1 | ratio | corpus=10000 | — |
| mrr_cat_exact_lookup | 0.1333 | ratio | corpus=10000 | — |
| p_at_5_cat_paraphrase | 0 | ratio | corpus=10000 | — |
| mrr_cat_paraphrase | 0 | ratio | corpus=10000 | — |
| p_at_5_cat_problem_lookup | 0 | ratio | corpus=10000 | — |
| mrr_cat_problem_lookup | 0 | ratio | corpus=10000 | — |
| p_at_5_cat_ambiguous | 0.05 | ratio | corpus=10000 | — |
| mrr_cat_ambiguous | 0.25 | ratio | corpus=10000 | — |
| r_at_5_table_decisions | 0.25 | ratio | corpus=10000 | — |
| r_at_5_table_loa_entries | 0 | ratio | corpus=10000 | — |
| r_at_5_table_breadcrumbs | 1 | ratio | corpus=10000 | — |
| r_at_5_table_learnings | 0 | ratio | corpus=10000 | — |
| r_at_5_prov_user_authored | 0 | ratio | corpus=10000 | — |
| r_at_5_prov_extracted | 0.2222 | ratio | corpus=10000 | — |
| r_at_5_prov_verbatim | 0.5 | ratio | corpus=10000 | — |
| r_at_5_prov_derived | 0 | ratio | corpus=10000 | — |
| p_at_5 | 0 | ratio | corpus=100000 | — |
| r_at_5 | 0 | ratio | corpus=100000 | — |
| mrr | 0 | ratio (MRR@5) | corpus=100000 | — |
| latency_p50_ms | 2.626 | ms | corpus=100000 | — |
| latency_p95_ms | 36.182 | ms | corpus=100000 | — |
| p_at_5_cat_exact_lookup | 0 | ratio | corpus=100000 | — |
| mrr_cat_exact_lookup | 0 | ratio | corpus=100000 | — |
| p_at_5_cat_paraphrase | 0 | ratio | corpus=100000 | — |
| mrr_cat_paraphrase | 0 | ratio | corpus=100000 | — |
| p_at_5_cat_problem_lookup | 0 | ratio | corpus=100000 | — |
| mrr_cat_problem_lookup | 0 | ratio | corpus=100000 | — |
| p_at_5_cat_ambiguous | 0 | ratio | corpus=100000 | — |
| mrr_cat_ambiguous | 0 | ratio | corpus=100000 | — |
| r_at_5_table_decisions | 0 | ratio | corpus=100000 | — |
| r_at_5_table_loa_entries | 0 | ratio | corpus=100000 | — |
| r_at_5_table_breadcrumbs | 0 | ratio | corpus=100000 | — |
| r_at_5_table_learnings | 0 | ratio | corpus=100000 | — |
| r_at_5_prov_user_authored | 0 | ratio | corpus=100000 | — |
| r_at_5_prov_extracted | 0 | ratio | corpus=100000 | — |
| r_at_5_prov_verbatim | 0 | ratio | corpus=100000 | — |
| r_at_5_prov_derived | 0 | ratio | corpus=100000 | — |

### Caveats

- Synthetic corpus: deterministic seeded fixtures (seed 47). Absolute scores do not transfer to real-world corpora; compare runs only against this same fixture set.
- Latency protocol: 1 unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. Relevance metrics come from the first measured pass (retrieval is deterministic for a fixed corpus).
- Embedding service available: no. Suite C exercises the FTS5 keyword path (search()) only — semantic/hybrid retrieval is NOT measured in this baseline either way.
- FTS5 MATCH is implicit AND with no stemming — paraphrase-category queries are expected to score near zero on keyword search. That gap is part of the honest baseline this suite records.
- Dedup was NOT run before measurement: the corpus contains unmarked near-duplicates that legitimately compete in ranking. search() excludes only records already marked in dedup_lineage.
- Ground truth never includes messages-table records — messages are noise-only in this corpus. The project column is part of every FTS index, so unscoped queries can match records via their project name alone.
- No pass/fail threshold — baseline-first. Later regression gating can diff future runs against the checked-in baseline JSONL.

---
_All metrics are unblended. We do not publish composite scores. See the per-suite caveats before drawing conclusions._
6 changes: 5 additions & 1 deletion benchmarks/runner.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import { mkdirSync, writeFileSync, existsSync } from 'fs';
import { join, dirname } from 'path';
import { runSuiteB } from './suites/suite-b-token-efficiency.js';
import { runSuiteC } from './suites/suite-c-precision-noise.js';
import type { RunResult, SuiteResult, SuiteId } from './types.js';

const RESULTS_DIR = join(import.meta.dir, 'results');
Expand Down Expand Up @@ -38,8 +39,11 @@ async function dispatchSuite(suite: SuiteId, project?: string): Promise<SuiteRes
switch (suite) {
case 'B':
return runSuiteB({ project });
case 'A':
case 'C':
// Suite C builds its own synthetic corpora — the project scope does
// not apply to it.
return runSuiteC();
case 'A':
case 'D':
case 'E':
// Stub — these suites are planned but not implemented in this slice.
Expand Down
Loading
Loading