Skip to content

feat(benchmarks): add Suite C precision-under-noise benchmark#64

Merged
edheltzel merged 2 commits into
mainfrom
feat/47-suite-c
Jun 11, 2026
Merged

feat(benchmarks): add Suite C precision-under-noise benchmark#64
edheltzel merged 2 commits into
mainfrom
feat/47-suite-c

Conversation

@edheltzel

Copy link
Copy Markdown
Owner

Closes #47

What

Adds Suite C to the benchmark harness — the final item of the MemPalace roadmap. It measures whether search() retrieves the right high-signal memory when the database contains many irrelevant records.

  • Corpus ladder: 100 / 1,000 / 10,000 / 100,000 records, built in a temporary DB via the real initDb() schema and the real src/lib/memory.ts write paths (FTS triggers fire as in production). The user's real DB is never touched; the env override is saved and restored.
  • Deterministic fixtures: seeded PRNG (mulberry32, seed 47). Same seed + size → byte-identical corpus content; tests assert it.
  • Ground truth: fixed target records labeled with table / project / provenance (Add Record Provenance across memory records #42), constant across sizes. Noise comes in three labeled roles: near-duplicates of targets, entity-name collisions, and low-signal filler, spread across all five searchable tables.
  • Query set: the four categories from the issue — exact project/name lookup, paraphrased decision lookup, learning/problem lookup, and noisy ambiguous queries with explicit collision labels (name / project / topic) so failures are attributable.
  • Metrics: P@5, R@5, MRR@5 per corpus size + breakdowns by query category, ground-truth table, and provenance. Latency p50/p95 with the warmup/repeat protocol documented in the caveats, including whether the embedding service was available (FTS5 keyword path only in this baseline).
  • Wiring: dispatchSuite case C, recall benchmark list shows C as built, README methodology section added.

Baseline (committed)

First honest run, no pass/fail threshold per the issue. Headline: exact-lookup MRR degrades 1.0 → 0 from 100 → 100k records as unmarked near-duplicates crowd originals out of the top-5 (LoA near-dups double-count terms across title+extract; message near-dups win bm25 length normalization). Latency p95 grows 0.5ms → 36ms. This is the measurable gap dedup lineage (#45) exists to close — a post-dedup run can be compared directly against this JSONL.

Tests

  • Fixture determinism (spec-level and seeded-DB content digest)
  • Ground-truth label consistency (keys resolve; project-scoped queries only expect in-project records; collision labels present; targets constant across sizes)
  • Metric calculation (P@k, R@k, RR cutoff behavior, nearest-rank percentile)
  • Report generation (sample set shape, ratio bounds, caveat protocol text, markdown rendering)
  • Updated the runner all-suites test for C being built (env-capped sizes keep CI fast)

Gate

  • bun run lint clean
  • bun test 652 pass / 0 fail
  • recall benchmark run C end-to-end produces the committed baseline (full ladder, ~5s)

Builds seeded deterministic corpora (100/1k/10k/100k records) in a temp DB
using the real schema and write paths, then measures the real FTS5 search()
path against a ground-truth-labeled query set (exact lookup, paraphrase,
problem lookup, ambiguous-with-collisions). Reports P@5, R@5, MRR@5, and
latency p50/p95 per corpus size, with breakdowns by query category, target
table, and provenance (#42). Near-duplicate noise exercises the gap dedup
lineage (#45) exists to close; dedup is intentionally not run so the
baseline records the unmitigated behavior.

Baseline-first per the issue: no pass/fail threshold. Env overrides
RECALL_BENCH_C_SIZES / RECALL_BENCH_C_REPEATS keep CI fast.
First honest baseline: exact-lookup MRR degrades 1.0 -> 0 from 100 to 100k
records as unmarked near-duplicates crowd originals out of the top-5;
latency p95 grows 0.5ms -> 36ms. Future regression gating diffs against
this JSONL.
@edheltzel

Copy link
Copy Markdown
Owner Author

Review — PR #64: Suite C precision-under-noise benchmark

Reviewed head 575a60e against issue #47 and the benchmarks/README.md methodology, via the pr-review-toolkit (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer) plus independent hands-on verification in a clean worktree. (Posted as a comment — GitHub blocks a formal approval from the PR author's account.)

Independent verification (all passed)

  1. Deterministic reproduction. Ran recall benchmark run C end-to-end on the branch (full ladder, ~5s). All 76 non-latency samples byte-identical to the committed baseline JSONL; fixture counts match (100,000 records, 14 queries, 12 seeded targets at 100k).
  2. loaloa_entries mapping. search() reports LoA rows under the logical name 'loa' (SEARCH_TABLES, src/lib/memory.ts:261; the loa branch joins loa_fts → physical loa_entries but pushes table: 'loa'). The harness maps only the expected side via searchTableName() (suite-c-internals.ts:624), leaving retrieved refs as search() emits them — correct direction, same rowid space, and the corpus=100 exact-lookup MRR of 1.0 (which includes the LoA-targeted q_exact_helios) confirms it isn't papering over a different mismatch.
  3. MRR collapse mechanism is real. Inspected actual top-5 results on the generated seed-47/100k corpus for all four exact-lookup queries: every top-5 slot is occupied by near_duplicate noise (verbatim target text + suffix), while the true targets remain present and retrievable at ranks 9, 10, 102, and 498 in a deep search. Genuine crowding, not a fixture artifact or ground-truth labeling error. The PR's specific mechanism claims also hold: message-table near-dups win for zephyr/helios (bm25 length normalization), LoA near-dups win for glacier/release (terms counted across title+extract).
  4. DB isolation. Env save/restore is try/finally with correct unset-vs-set handling for both RECALL_DB_PATH and MEM_DB_PATH (suite-c-precision-noise.ts:156-206), pinned by the sentinel test. Empirically: user DB mtime unchanged after a full run; no temp-dir leftovers.
  5. CI. Green on the head commit — gh run list --commit 575a60e shows the CI workflow completed/success for both jobs.

Issue #47 acceptance criteria

All met: corpus ladder 100/1k/10k/100k; seeded deterministic fixtures (asserted by tests and verified end-to-end above); ground truth labeled with table/project/provenance; all four query categories with typed name/project/topic collision labels on ambiguous queries; P@5/R@5/MRR + latency p50/p95 with the warmup/repeat protocol and embedding availability documented in the caveats; table and provenance breakdowns; results in benchmarks/results/; honest baseline with no invented threshold; tests for all four required areas; README methodology section (every number verified against the code). Suite B architectural patterns are followed faithfully (internals/driver split, shared SuiteResult/MetricSample types, no composite scores, surgical runner/benchmark list wiring).

Findings — none blocking; recommended follow-ups

  1. Harness can record silent zeros (most substantive). runQuery never consults getLastSearchErrors() (suite-c-precision-noise.ts:76), and search() swallows per-table errors into that channel (src/lib/memory.ts:389-394). If FTS triggers regressed or a future query hit FTS5 syntax errors, every metric would read 0 — indistinguishable from real degradation, and the test suite would stay green because no test asserts a nonzero score. Three cheap guards close this: throw if getLastSearchErrors() is non-empty after each query; reconcile seeded row counts against spec.size after seedFixture (and make the default: continue at suite-c-internals.ts:589 throw); add one test anchor like r_at_5 > 0 at corpus=100 (deterministically 0.8095 per this baseline). Recommend a follow-up issue — this belongs with the regression-gating work the issue already defers, but should land before any baseline re-record.
  2. Circular test. "Seeded targets resolve with matching table/project/provenance" (tests/benchmarks/suite-c.test.ts:239-255) compares spec values against echoes of the same spec values (suite-c-internals.ts:593). SELECTing project/provenance back from the declared table would make it the end-to-end check it reads as.
  3. Env override validation. An invalid RECALL_BENCH_C_SIZES (e.g. a typo) silently falls back to the full default ladder including the 100k build (suite-c-precision-noise.ts:54-59); an explicitly-set override that can't be honored should throw. Also untested: nothing asserts the override actually takes effect, and suite-b.test.ts:252-253 deletes rather than restores the env vars.
  4. Docs nits. benchmarks/README.md:3 references .atlas/plans/2026-04-17-mempalace-research-borrow-list.md — a gitignored, nonexistent-in-repo path carried forward on a line this PR touches. The r_at_5_table_*/r_at_5_prov_* breakdowns are per-(query, expected-record) pairs, not per-unique-record — the comment at suite-c-precision-noise.ts:123-125 reads as the latter. The file header's "Regression gating compares future runs…" (suite-c-precision-noise.ts:15) states future work in present tense; the caveat's "can diff" phrasing is the accurate one.

Strengths

Benchmarks the real write/search paths instead of a strawman; determinism holds end-to-end across machines-and-runs (verified, not just asserted); the baseline is honest — paraphrase queries scoring ~0 on keyword search and the 100k collapse are reported, caveated, and explained rather than tuned away; isolation of the user DB is exactly right and test-pinned; this is precisely the measurable gap that dedup lineage (#45) exists to close, now with a committed JSONL to diff against.

Verdict: APPROVE

@edheltzel edheltzel merged commit aad7823 into main Jun 11, 2026
2 checks passed
@edheltzel edheltzel deleted the feat/47-suite-c branch June 11, 2026 20:06
edheltzel added a commit that referenced this pull request Jun 11, 2026
Roadmap complete (#40-#48 all closed, PRs #54-#62, #64 merged). Tracker
and handoff moved to .agents/atlas/plans/archive/ per the completed-plans
rule; archive/ is gitignored (local artifacts), full content remains in
git history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Suite C benchmark for precision under noise

1 participant