feat: add LongMemEval benchmark script (was missing — published numbers weren't reproducible) by anthroos · Pull Request #13 · anthroos/openexp

anthroos · 2026-05-02T03:42:52Z

Why

docs/benchmark-results.md (already on main) cites R@1 = 0.880, R@10 = 0.986, NDCG@10 = 0.924 and gives users a "Reproduce These Results" section with:

mkdir -p benchmarks/data
curl -L -o benchmarks/data/longmemeval_s_cleaned.json \
  "https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/..."

…but benchmarks/ doesn't exist in the repo. Anyone trying to reproduce ends with a downloaded dataset and no runner.

This commits the script behind the published numbers.

What

benchmarks/longmemeval_bench.py — the runner. Three modes:
- raw — pure vector similarity (baseline, like MemPalace raw)
- hybrid — vector + BM25 keyword scoring (OpenExp default)
- full — vector + BM25 + recency
- Embedding: BAAI/bge-small-en-v1.5 (same as production)
- In-memory Qdrant per question for clean isolation
- Metrics computed the same way as MemPalace for apples-to-apples
benchmarks/README.md — download URL, dataset format, three example invocations, expected runtime, expected numbers
.gitignore — benchmarks/data/ added so the ~277 MB dataset can sit on disk without risk of accidental commit

No new dependencies. Uses fastembed and qdrant_client already in the project.

Test plan

python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode hybrid --limit 20 — should run end-to-end on 20 questions in a few minutes
Full 500-question run in hybrid mode reproduces the numbers in docs/benchmark-results.md (R@1 ≈ 0.880, R@10 ≈ 0.986, NDCG@10 ≈ 0.924)

docs/benchmark-results.md ships in the repo and tells readers to "Reproduce These Results" with `curl -L -o benchmarks/data/...`, but the benchmarks/ directory and the runner script were never in the repo — the published numbers (R@1=0.880, R@10=0.986, NDCG@10=0.924) weren't reproducible. This adds: - benchmarks/longmemeval_bench.py — the runner. Three modes (raw, hybrid, full). Embedding: BAAI/bge-small-en-v1.5 (production model). Qdrant in-memory per question. Same metrics as MemPalace for apples-to-apples comparison. - benchmarks/README.md — download + run instructions. - .gitignore — adds benchmarks/data/ so the ~277 MB dataset can sit on disk without risk of accidental commit. No external dependencies beyond what the project already uses (fastembed, qdrant_client).

anthroos merged commit 5d166e5 into main May 2, 2026
3 checks passed

anthroos deleted the feat/public-benchmark-script branch May 2, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LongMemEval benchmark script (was missing — published numbers weren't reproducible)#13

feat: add LongMemEval benchmark script (was missing — published numbers weren't reproducible)#13
anthroos merged 1 commit intomainfrom
feat/public-benchmark-script

anthroos commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anthroos commented May 2, 2026

Why

What

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant