Skip to content

feat: add LongMemEval benchmark script (was missing — published numbers weren't reproducible)#13

Merged
anthroos merged 1 commit intomainfrom
feat/public-benchmark-script
May 2, 2026
Merged

feat: add LongMemEval benchmark script (was missing — published numbers weren't reproducible)#13
anthroos merged 1 commit intomainfrom
feat/public-benchmark-script

Conversation

@anthroos
Copy link
Copy Markdown
Owner

@anthroos anthroos commented May 2, 2026

Why

docs/benchmark-results.md (already on main) cites R@1 = 0.880, R@10 = 0.986, NDCG@10 = 0.924 and gives users a "Reproduce These Results" section with:

mkdir -p benchmarks/data
curl -L -o benchmarks/data/longmemeval_s_cleaned.json \
  "https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/..."

…but benchmarks/ doesn't exist in the repo. Anyone trying to reproduce ends with a downloaded dataset and no runner.

This commits the script behind the published numbers.

What

  • benchmarks/longmemeval_bench.py — the runner. Three modes:
    • raw — pure vector similarity (baseline, like MemPalace raw)
    • hybrid — vector + BM25 keyword scoring (OpenExp default)
    • full — vector + BM25 + recency
    • Embedding: BAAI/bge-small-en-v1.5 (same as production)
    • In-memory Qdrant per question for clean isolation
    • Metrics computed the same way as MemPalace for apples-to-apples
  • benchmarks/README.md — download URL, dataset format, three example invocations, expected runtime, expected numbers
  • .gitignorebenchmarks/data/ added so the ~277 MB dataset can sit on disk without risk of accidental commit

No new dependencies. Uses fastembed and qdrant_client already in the project.

Test plan

  • python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode hybrid --limit 20 — should run end-to-end on 20 questions in a few minutes
  • Full 500-question run in hybrid mode reproduces the numbers in docs/benchmark-results.md (R@1 ≈ 0.880, R@10 ≈ 0.986, NDCG@10 ≈ 0.924)

docs/benchmark-results.md ships in the repo and tells readers to
"Reproduce These Results" with `curl -L -o benchmarks/data/...`,
but the benchmarks/ directory and the runner script were never
in the repo — the published numbers (R@1=0.880, R@10=0.986,
NDCG@10=0.924) weren't reproducible.

This adds:

- benchmarks/longmemeval_bench.py — the runner. Three modes (raw,
  hybrid, full). Embedding: BAAI/bge-small-en-v1.5 (production
  model). Qdrant in-memory per question. Same metrics as MemPalace
  for apples-to-apples comparison.
- benchmarks/README.md — download + run instructions.
- .gitignore — adds benchmarks/data/ so the ~277 MB dataset can
  sit on disk without risk of accidental commit.

No external dependencies beyond what the project already uses
(fastembed, qdrant_client).
@anthroos anthroos merged commit 5d166e5 into main May 2, 2026
3 checks passed
@anthroos anthroos deleted the feat/public-benchmark-script branch May 2, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant