feat(benchmarks): add Suite C precision-under-noise benchmark by edheltzel · Pull Request #64 · edheltzel/Recall

edheltzel · 2026-06-11T09:39:05Z

Closes #47

What

Adds Suite C to the benchmark harness — the final item of the MemPalace roadmap. It measures whether search() retrieves the right high-signal memory when the database contains many irrelevant records.

Corpus ladder: 100 / 1,000 / 10,000 / 100,000 records, built in a temporary DB via the real initDb() schema and the real src/lib/memory.ts write paths (FTS triggers fire as in production). The user's real DB is never touched; the env override is saved and restored.
Deterministic fixtures: seeded PRNG (mulberry32, seed 47). Same seed + size → byte-identical corpus content; tests assert it.
Ground truth: fixed target records labeled with table / project / provenance (Add Record Provenance across memory records #42), constant across sizes. Noise comes in three labeled roles: near-duplicates of targets, entity-name collisions, and low-signal filler, spread across all five searchable tables.
Query set: the four categories from the issue — exact project/name lookup, paraphrased decision lookup, learning/problem lookup, and noisy ambiguous queries with explicit collision labels (name / project / topic) so failures are attributable.
Metrics: P@5, R@5, MRR@5 per corpus size + breakdowns by query category, ground-truth table, and provenance. Latency p50/p95 with the warmup/repeat protocol documented in the caveats, including whether the embedding service was available (FTS5 keyword path only in this baseline).
Wiring: dispatchSuite case C, recall benchmark list shows C as built, README methodology section added.

Baseline (committed)

First honest run, no pass/fail threshold per the issue. Headline: exact-lookup MRR degrades 1.0 → 0 from 100 → 100k records as unmarked near-duplicates crowd originals out of the top-5 (LoA near-dups double-count terms across title+extract; message near-dups win bm25 length normalization). Latency p95 grows 0.5ms → 36ms. This is the measurable gap dedup lineage (#45) exists to close — a post-dedup run can be compared directly against this JSONL.

Tests

Fixture determinism (spec-level and seeded-DB content digest)
Ground-truth label consistency (keys resolve; project-scoped queries only expect in-project records; collision labels present; targets constant across sizes)
Metric calculation (P@k, R@k, RR cutoff behavior, nearest-rank percentile)
Report generation (sample set shape, ratio bounds, caveat protocol text, markdown rendering)
Updated the runner all-suites test for C being built (env-capped sizes keep CI fast)

Gate

bun run lint clean
bun test 652 pass / 0 fail
recall benchmark run C end-to-end produces the committed baseline (full ladder, ~5s)

Builds seeded deterministic corpora (100/1k/10k/100k records) in a temp DB using the real schema and write paths, then measures the real FTS5 search() path against a ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions). Reports P@5, R@5, MRR@5, and latency p50/p95 per corpus size, with breakdowns by query category, target table, and provenance (#42). Near-duplicate noise exercises the gap dedup lineage (#45) exists to close; dedup is intentionally not run so the baseline records the unmitigated behavior. Baseline-first per the issue: no pass/fail threshold. Env overrides RECALL_BENCH_C_SIZES / RECALL_BENCH_C_REPEATS keep CI fast.

First honest baseline: exact-lookup MRR degrades 1.0 -> 0 from 100 to 100k records as unmarked near-duplicates crowd originals out of the top-5; latency p95 grows 0.5ms -> 36ms. Future regression gating diffs against this JSONL.

edheltzel · 2026-06-11T10:10:53Z

Review — PR #64: Suite C precision-under-noise benchmark

Reviewed head 575a60e against issue #47 and the benchmarks/README.md methodology, via the pr-review-toolkit (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer) plus independent hands-on verification in a clean worktree. (Posted as a comment — GitHub blocks a formal approval from the PR author's account.)

Independent verification (all passed)

Deterministic reproduction. Ran recall benchmark run C end-to-end on the branch (full ladder, ~5s). All 76 non-latency samples byte-identical to the committed baseline JSONL; fixture counts match (100,000 records, 14 queries, 12 seeded targets at 100k).
loa ↔ loa_entries mapping. search() reports LoA rows under the logical name 'loa' (SEARCH_TABLES, src/lib/memory.ts:261; the loa branch joins loa_fts → physical loa_entries but pushes table: 'loa'). The harness maps only the expected side via searchTableName() (suite-c-internals.ts:624), leaving retrieved refs as search() emits them — correct direction, same rowid space, and the corpus=100 exact-lookup MRR of 1.0 (which includes the LoA-targeted q_exact_helios) confirms it isn't papering over a different mismatch.
MRR collapse mechanism is real. Inspected actual top-5 results on the generated seed-47/100k corpus for all four exact-lookup queries: every top-5 slot is occupied by near_duplicate noise (verbatim target text + suffix), while the true targets remain present and retrievable at ranks 9, 10, 102, and 498 in a deep search. Genuine crowding, not a fixture artifact or ground-truth labeling error. The PR's specific mechanism claims also hold: message-table near-dups win for zephyr/helios (bm25 length normalization), LoA near-dups win for glacier/release (terms counted across title+extract).
DB isolation. Env save/restore is try/finally with correct unset-vs-set handling for both RECALL_DB_PATH and MEM_DB_PATH (suite-c-precision-noise.ts:156-206), pinned by the sentinel test. Empirically: user DB mtime unchanged after a full run; no temp-dir leftovers.
CI. Green on the head commit — gh run list --commit 575a60e shows the CI workflow completed/success for both jobs.

Issue #47 acceptance criteria

All met: corpus ladder 100/1k/10k/100k; seeded deterministic fixtures (asserted by tests and verified end-to-end above); ground truth labeled with table/project/provenance; all four query categories with typed name/project/topic collision labels on ambiguous queries; P@5/R@5/MRR + latency p50/p95 with the warmup/repeat protocol and embedding availability documented in the caveats; table and provenance breakdowns; results in benchmarks/results/; honest baseline with no invented threshold; tests for all four required areas; README methodology section (every number verified against the code). Suite B architectural patterns are followed faithfully (internals/driver split, shared SuiteResult/MetricSample types, no composite scores, surgical runner/benchmark list wiring).

Findings — none blocking; recommended follow-ups

Harness can record silent zeros (most substantive). runQuery never consults getLastSearchErrors() (suite-c-precision-noise.ts:76), and search() swallows per-table errors into that channel (src/lib/memory.ts:389-394). If FTS triggers regressed or a future query hit FTS5 syntax errors, every metric would read 0 — indistinguishable from real degradation, and the test suite would stay green because no test asserts a nonzero score. Three cheap guards close this: throw if getLastSearchErrors() is non-empty after each query; reconcile seeded row counts against spec.size after seedFixture (and make the default: continue at suite-c-internals.ts:589 throw); add one test anchor like r_at_5 > 0 at corpus=100 (deterministically 0.8095 per this baseline). Recommend a follow-up issue — this belongs with the regression-gating work the issue already defers, but should land before any baseline re-record.
Circular test. "Seeded targets resolve with matching table/project/provenance" (tests/benchmarks/suite-c.test.ts:239-255) compares spec values against echoes of the same spec values (suite-c-internals.ts:593). SELECTing project/provenance back from the declared table would make it the end-to-end check it reads as.
Env override validation. An invalid RECALL_BENCH_C_SIZES (e.g. a typo) silently falls back to the full default ladder including the 100k build (suite-c-precision-noise.ts:54-59); an explicitly-set override that can't be honored should throw. Also untested: nothing asserts the override actually takes effect, and suite-b.test.ts:252-253 deletes rather than restores the env vars.
Docs nits. benchmarks/README.md:3 references .atlas/plans/2026-04-17-mempalace-research-borrow-list.md — a gitignored, nonexistent-in-repo path carried forward on a line this PR touches. The r_at_5_table_*/r_at_5_prov_* breakdowns are per-(query, expected-record) pairs, not per-unique-record — the comment at suite-c-precision-noise.ts:123-125 reads as the latter. The file header's "Regression gating compares future runs…" (suite-c-precision-noise.ts:15) states future work in present tense; the caveat's "can diff" phrasing is the accurate one.

Strengths

Benchmarks the real write/search paths instead of a strawman; determinism holds end-to-end across machines-and-runs (verified, not just asserted); the baseline is honest — paraphrase queries scoring ~0 on keyword search and the 100k collapse are reported, caveated, and explained rather than tuned away; isolation of the user DB is exactly right and test-pinned; this is precisely the measurable gap that dedup lineage (#45) exists to close, now with a committed JSONL to diff against.

Verdict: APPROVE

Roadmap complete (#40-#48 all closed, PRs #54-#62, #64 merged). Tracker and handoff moved to .agents/atlas/plans/archive/ per the completed-plans rule; archive/ is gitignored (local artifacts), full content remains in git history.

edheltzel added 2 commits June 11, 2026 05:38

edheltzel merged commit aad7823 into main Jun 11, 2026
2 checks passed

edheltzel deleted the feat/47-suite-c branch June 11, 2026 20:06

edheltzel mentioned this pull request Jun 11, 2026

chore: archive completed MemPalace plan docs #65

Merged

edheltzel mentioned this pull request Jun 11, 2026

benchmarks: Suite C silent-zero hardening — required before any baseline re-record #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmarks): add Suite C precision-under-noise benchmark#64

feat(benchmarks): add Suite C precision-under-noise benchmark#64
edheltzel merged 2 commits into
mainfrom
feat/47-suite-c

edheltzel commented Jun 11, 2026

Uh oh!

edheltzel commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

edheltzel commented Jun 11, 2026

What

Baseline (committed)

Tests

Gate

Uh oh!

edheltzel commented Jun 11, 2026

Review — PR #64: Suite C precision-under-noise benchmark

Independent verification (all passed)

Issue #47 acceptance criteria

Findings — none blocking; recommended follow-ups

Strengths

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant