Standalone MemoryArena evaluation runner with offline and official-backend support.
This repository contains only the evaluation surface needed to run MemoryArena experiments. It intentionally excludes the old runtime-learning stack, training runners, non-MemoryArena benchmark integrations, vendored benchmark repos, paper artifacts, and result dumps.
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install --upgrade -r requirements.txtOptional official backend setup can also be bootstrapped in one command:
python scripts/setup_backends.pyThat command installs the base Python dependencies, clones WebShop and
TravelPlanner into .backends/, runs their default setup steps, and writes
configs/memoryarena_eval.local.yaml. WebShop setup uses -d all by default
for official evaluation; use --webshop_download small only for a faster smoke
setup. If you already have the official repos, pass --webshop_path /path/to/WebShop and
--travelplanner_path /path/to/TravelPlanner instead of cloning.
The bootstrap filters WebShop's optional demo/browser/test/ML dependencies
(env, gradio, numpy, pandas, pytest, requests_mock,
scikit_learn, selenium, spacy, torch, transformers, train) and
stale pins for core packages because MemoryArena uses text observations and
keeps its own compatible base dependency set; pass --webshop_install_ml_deps
if you want WebShop's full dependency set.
It also filters TravelPlanner's demo-agent/provider dependencies by default;
pass --travelplanner_install_full_deps if you want its full upstream
requirements.
If an existing conda/venv environment already has a broken numpy/pandas ABI pair, repair it before rerunning setup:
python -m pip install --upgrade --force-reinstall "numpy>=1.26,<3" "pandas>=2.2.2,<3"
python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"BrowseComp-Plus is intentionally a second step because preparing its retrieval cache can require a separate embedding/retrieval pipeline. After you have either a cache or retrieval endpoint, update the generated local config:
python scripts/setup_backends.py \
--configure_only \
--force \
--browsecomp_plus_cache_path /path/to/browsecomp_plus_results.jsonlor:
python scripts/setup_backends.py \
--configure_only \
--force \
--browsecomp_plus_retriever_url http://localhost:9000/searchWithout one of those BrowseComp-Plus resources, progressive_search remains on
the offline backend.
For the most convenient fixed-corpus BrowseComp-Plus setup, point MemoryArena
directly at a vLLM embedding endpoint. The evaluator will call /models to
discover the model id, call /embeddings for query vectors, auto-select the
matching official Qwen3 BrowseComp-Plus index, and retrieve from the fixed
BrowseComp-Plus corpus in-process:
python scripts/setup_backends.py \
--configure_only \
--force \
--backends browsecomp_plus \
--browsecomp_plus_embedding_base_url http://localhost:8000/v1To prepare a local BrowseComp-Plus cache in one command, start a vLLM embedding server and let the helper search the official BrowseComp-Plus embedding index:
vllm serve Qwen/Qwen3-Embedding-4B --task embed --port 8000python scripts/prepare_browsecomp_plus_cache.py \
--embedding_base_url http://localhost:8000/v1The user only needs to provide the OpenAI-compatible vLLM endpoint. The script
first calls <endpoint>/models to discover the served model id, then calls
<endpoint>/embeddings to discover the vector dimension, clones BrowseComp-Plus
into .backends/, auto-selects the matching official Qwen3 embedding index
(qwen3-embedding-0.6b, qwen3-embedding-4b, or qwen3-embedding-8b), exports
MemoryArena progressive_search queries, writes
local_cache/browsecomp_plus_progressive_search.jsonl, and writes a local
config pointing at the cache. If you pass http://localhost:8000, the helper
treats it as http://localhost:8000/v1. By default, it updates the local
config and overwrites the generated cache/config paths; pass
--no_update_local_config, --no_overwrite_cache, or --no_overwrite_config
only when you want to preserve existing files. If your endpoint serves a
non-Qwen3 embedding model, pass --index_path for a corpus index built with
that exact model, or use --backend official_bm25.
If you prefer the lighter official BM25 retriever:
python scripts/prepare_browsecomp_plus_cache.py \
--backend official_bm25 \
--install_bm25_depsPyserini still requires Java; if needed, install it with
conda install -c conda-forge openjdk=21.
For dynamic query modes, run BM25 as a live local retriever instead of a fixed JSONL cache:
python scripts/browsecomp_bm25_server.py \
--port 9001 \
--install_bm25_depsThen point MemoryArena at the server:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_retriever_url http://localhost:9001/search \
--progressive_search_eval_mode agentic_searchTo reproduce the older Qdrant embedding retriever path, first build a local BrowseComp-Plus Qdrant index with the same embedding endpoint you will use at query time:
python -m pip install qdrant-client
python scripts/build_browsecomp_qdrant_index.py \
--qdrant-path data/browsecomp_qdrant \
--embed-base-url http://localhost:8001/v1 \
--embed-model Qwen/Qwen3-Embedding-8BThen serve the index:
python scripts/browsecomp_qdrant_server.py \
--qdrant-path data/browsecomp_qdrant \
--embed-base-url http://localhost:8001/v1 \
--embed-model Qwen/Qwen3-Embedding-8B \
--port 8099Run the three-round iterative search diagnostic against that server:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_retriever_url http://localhost:8099/search \
--progressive_search_eval_mode diagnostic_retrieval \
--progressive_search_rounds 3 \
--max_tokens 32768To approximate the paper's BM25 RAG baseline, keep the Progressive Web Search
tool on the embedding retriever and switch only the episode memory retriever to
BM25. The rag baseline disables state-first compilation, but still stores and
retrieves prior subtask traces:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_retriever_url http://localhost:8099/search \
--progressive_search_eval_mode diagnostic_retrieval \
--progressive_search_rounds 3 \
--baseline_mode rag \
--memory_retriever_backend bm25 \
--max_tokens 32768To run the long-context baseline, disable memory retrieval and append every prior subtask trace that fits in the model context window:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_retriever_url http://localhost:8099/search \
--progressive_search_eval_mode diagnostic_retrieval \
--progressive_search_rounds 1 \
--baseline_mode long_context \
--max_tokens 32768To run the Mem0 memory-agent baseline, keep the Progressive Web Search tool on BrowseComp-Plus and point Mem0 at an embedding endpoint. Mem0 defaults to the actor LLM for memory extraction and stores each parallel worker in isolated Qdrant and Mem0-history paths under the run directory:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_retriever_url http://localhost:8099/search \
--progressive_search_eval_mode paper_compatible \
--progressive_search_rounds 1 \
--baseline_mode mem0 \
--mem0_embedding_base_url http://localhost:8001/v1 \
--mem0_embedding_api_key EMPTY \
--mem0_embedding_model Qwen/Qwen3-Embedding-8B \
--mem0_vector_dimension 4096 \
--max_tokens 32768Pass --mem0_no_infer only for the raw-trace Mem0 ablation; the baseline keeps
Mem0 fact extraction enabled.
If you already run a BrowseComp-Plus retriever service, generate the same cache without local indexes:
python scripts/prepare_browsecomp_plus_cache.py \
--backend http \
--retriever_url http://localhost:9000/searchpython run/run_memarena.py --help
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--num_tasks 5For progressive_search, the default remains offline QA. To run against
BrowseComp-Plus evidence without starting a separate retriever server, provide
the embedding endpoint:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_embedding_base_url http://localhost:8000/v1progressive_search_eval_mode defaults to paper_compatible: the evaluator
retrieves once with the original subtask query and keeps top_k=5 as the
paper-comparable retrieval policy. Runner-generated query refinement is a
diagnostic mode, not a strict paper reproduction:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_embedding_base_url http://localhost:8000/v1 \
--progressive_search_eval_mode diagnostic_retrieval \
--progressive_search_rounds 5samples.jsonl records search_eval_mode, search_queries,
search_rounds, and search_paper_compatible so retrieval diagnostics do not
get mixed with paper-compatible runs.
You can still provide a retrieval cache or external retriever endpoint:
python run/run_memarena.py \
--config configs/memoryarena_eval.yaml \
--subset progressive_search \
--progressive_search_backend browsecomp_plus \
--browsecomp_plus_cache_path /path/to/browsecomp_plus_results.jsonlThe default config uses an OpenAI-compatible endpoint:
llm:
api_key: "EMPTY"
base_url: "http://localhost:8000/v1"
model: "Qwen3-4B-Instruct-2507"Adjust configs/memoryarena_eval.yaml or create a local config file for your
model endpoint and official benchmark environment paths.
Included:
memoryarena_eval/: standalone evaluator package.run/run_memarena.py: CLI entrypoint.configs/memoryarena_eval*.yaml: eval configs.tests/test_memoryarena.py: MemoryArena regression tests.
Not included:
- Runtime memory-learning or training code.
- Non-MemoryArena benchmark runners.
- Vendored benchmark repositories.
- Paper docs, experiment notes, and result artifacts.