MemoryArena Eval

Standalone MemoryArena evaluation runner with offline and official-backend support.

This repository contains only the evaluation surface needed to run MemoryArena experiments. It intentionally excludes the old runtime-learning stack, training runners, non-MemoryArena benchmark integrations, vendored benchmark repos, paper artifacts, and result dumps.

Install

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install --upgrade -r requirements.txt

Optional official backend setup can also be bootstrapped in one command:

python scripts/setup_backends.py

That command installs the base Python dependencies, clones WebShop and TravelPlanner into .backends/, runs their default setup steps, and writes configs/memoryarena_eval.local.yaml. WebShop setup uses -d all by default for official evaluation; use --webshop_download small only for a faster smoke setup. If you already have the official repos, pass --webshop_path /path/to/WebShop and --travelplanner_path /path/to/TravelPlanner instead of cloning. The bootstrap filters WebShop's optional demo/browser/test/ML dependencies (env, gradio, numpy, pandas, pytest, requests_mock, scikit_learn, selenium, spacy, torch, transformers, train) and stale pins for core packages because MemoryArena uses text observations and keeps its own compatible base dependency set; pass --webshop_install_ml_deps if you want WebShop's full dependency set. It also filters TravelPlanner's demo-agent/provider dependencies by default; pass --travelplanner_install_full_deps if you want its full upstream requirements.

If an existing conda/venv environment already has a broken numpy/pandas ABI pair, repair it before rerunning setup:

python -m pip install --upgrade --force-reinstall "numpy>=1.26,<3" "pandas>=2.2.2,<3"
python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"

BrowseComp-Plus is intentionally a second step because preparing its retrieval cache can require a separate embedding/retrieval pipeline. After you have either a cache or retrieval endpoint, update the generated local config:

python scripts/setup_backends.py \
  --configure_only \
  --force \
  --browsecomp_plus_cache_path /path/to/browsecomp_plus_results.jsonl

or:

python scripts/setup_backends.py \
  --configure_only \
  --force \
  --browsecomp_plus_retriever_url http://localhost:9000/search

Without one of those BrowseComp-Plus resources, progressive_search remains on the offline backend.

For the most convenient fixed-corpus BrowseComp-Plus setup, point MemoryArena directly at a vLLM embedding endpoint. The evaluator will call /models to discover the model id, call /embeddings for query vectors, auto-select the matching official Qwen3 BrowseComp-Plus index, and retrieve from the fixed BrowseComp-Plus corpus in-process:

python scripts/setup_backends.py \
  --configure_only \
  --force \
  --backends browsecomp_plus \
  --browsecomp_plus_embedding_base_url http://localhost:8000/v1

To prepare a local BrowseComp-Plus cache in one command, start a vLLM embedding server and let the helper search the official BrowseComp-Plus embedding index:

vllm serve Qwen/Qwen3-Embedding-4B --task embed --port 8000

python scripts/prepare_browsecomp_plus_cache.py \
  --embedding_base_url http://localhost:8000/v1

The user only needs to provide the OpenAI-compatible vLLM endpoint. The script first calls <endpoint>/models to discover the served model id, then calls <endpoint>/embeddings to discover the vector dimension, clones BrowseComp-Plus into .backends/, auto-selects the matching official Qwen3 embedding index (qwen3-embedding-0.6b, qwen3-embedding-4b, or qwen3-embedding-8b), exports MemoryArena progressive_search queries, writes local_cache/browsecomp_plus_progressive_search.jsonl, and writes a local config pointing at the cache. If you pass http://localhost:8000, the helper treats it as http://localhost:8000/v1. By default, it updates the local config and overwrites the generated cache/config paths; pass --no_update_local_config, --no_overwrite_cache, or --no_overwrite_config only when you want to preserve existing files. If your endpoint serves a non-Qwen3 embedding model, pass --index_path for a corpus index built with that exact model, or use --backend official_bm25.

If you prefer the lighter official BM25 retriever:

python scripts/prepare_browsecomp_plus_cache.py \
  --backend official_bm25 \
  --install_bm25_deps

Pyserini still requires Java; if needed, install it with conda install -c conda-forge openjdk=21.

For dynamic query modes, run BM25 as a live local retriever instead of a fixed JSONL cache:

python scripts/browsecomp_bm25_server.py \
  --port 9001 \
  --install_bm25_deps

Then point MemoryArena at the server:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_retriever_url http://localhost:9001/search \
  --progressive_search_eval_mode agentic_search

To reproduce the older Qdrant embedding retriever path, first build a local BrowseComp-Plus Qdrant index with the same embedding endpoint you will use at query time:

python -m pip install qdrant-client

python scripts/build_browsecomp_qdrant_index.py \
  --qdrant-path data/browsecomp_qdrant \
  --embed-base-url http://localhost:8001/v1 \
  --embed-model Qwen/Qwen3-Embedding-8B

Then serve the index:

python scripts/browsecomp_qdrant_server.py \
  --qdrant-path data/browsecomp_qdrant \
  --embed-base-url http://localhost:8001/v1 \
  --embed-model Qwen/Qwen3-Embedding-8B \
  --port 8099

Run the three-round iterative search diagnostic against that server:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_retriever_url http://localhost:8099/search \
  --progressive_search_eval_mode diagnostic_retrieval \
  --progressive_search_rounds 3 \
  --max_tokens 32768

To approximate the paper's BM25 RAG baseline, keep the Progressive Web Search tool on the embedding retriever and switch only the episode memory retriever to BM25. The rag baseline disables state-first compilation, but still stores and retrieves prior subtask traces:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_retriever_url http://localhost:8099/search \
  --progressive_search_eval_mode diagnostic_retrieval \
  --progressive_search_rounds 3 \
  --baseline_mode rag \
  --memory_retriever_backend bm25 \
  --max_tokens 32768

To run the long-context baseline, disable memory retrieval and append every prior subtask trace that fits in the model context window:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_retriever_url http://localhost:8099/search \
  --progressive_search_eval_mode diagnostic_retrieval \
  --progressive_search_rounds 1 \
  --baseline_mode long_context \
  --max_tokens 32768

To run the Mem0 memory-agent baseline, keep the Progressive Web Search tool on BrowseComp-Plus and point Mem0 at an embedding endpoint. Mem0 defaults to the actor LLM for memory extraction and stores each parallel worker in isolated Qdrant and Mem0-history paths under the run directory:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_retriever_url http://localhost:8099/search \
  --progressive_search_eval_mode paper_compatible \
  --progressive_search_rounds 1 \
  --baseline_mode mem0 \
  --mem0_embedding_base_url http://localhost:8001/v1 \
  --mem0_embedding_api_key EMPTY \
  --mem0_embedding_model Qwen/Qwen3-Embedding-8B \
  --mem0_vector_dimension 4096 \
  --max_tokens 32768

Pass --mem0_no_infer only for the raw-trace Mem0 ablation; the baseline keeps Mem0 fact extraction enabled.

If you already run a BrowseComp-Plus retriever service, generate the same cache without local indexes:

python scripts/prepare_browsecomp_plus_cache.py \
  --backend http \
  --retriever_url http://localhost:9000/search

Run

python run/run_memarena.py --help
python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --num_tasks 5

For progressive_search, the default remains offline QA. To run against BrowseComp-Plus evidence without starting a separate retriever server, provide the embedding endpoint:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_embedding_base_url http://localhost:8000/v1

progressive_search_eval_mode defaults to paper_compatible: the evaluator retrieves once with the original subtask query and keeps top_k=5 as the paper-comparable retrieval policy. Runner-generated query refinement is a diagnostic mode, not a strict paper reproduction:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_embedding_base_url http://localhost:8000/v1 \
  --progressive_search_eval_mode diagnostic_retrieval \
  --progressive_search_rounds 5

samples.jsonl records search_eval_mode, search_queries, search_rounds, and search_paper_compatible so retrieval diagnostics do not get mixed with paper-compatible runs.

You can still provide a retrieval cache or external retriever endpoint:

python run/run_memarena.py \
  --config configs/memoryarena_eval.yaml \
  --subset progressive_search \
  --progressive_search_backend browsecomp_plus \
  --browsecomp_plus_cache_path /path/to/browsecomp_plus_results.jsonl

The default config uses an OpenAI-compatible endpoint:

llm:
  api_key: "EMPTY"
  base_url: "http://localhost:8000/v1"
  model: "Qwen3-4B-Instruct-2507"

Adjust configs/memoryarena_eval.yaml or create a local config file for your model endpoint and official benchmark environment paths.

Scope

Included:

memoryarena_eval/: standalone evaluator package.
run/run_memarena.py: CLI entrypoint.
configs/memoryarena_eval*.yaml: eval configs.
tests/test_memoryarena.py: MemoryArena regression tests.

Not included:

Runtime memory-learning or training code.
Non-MemoryArena benchmark runners.
Vendored benchmark repositories.
Paper docs, experiment notes, and result artifacts.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
memoryarena_eval		memoryarena_eval
run		run
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemoryArena Eval

Install

Run

Scope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MemoryArena Eval

Install

Run

Scope

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages