Some MCP servers in our original tool pool have been deprecated or taken offline, which makes a subset of questions in the original dataset no longer executable/reproducible. To address this, we release an updated version of the dataset. Please refer to mcpverse/data/mcpverse_time_invariant_v1.1.csv. Note that this version only includes time-invariant questions.
MCPVerse is a comprehensive benchmark built on a large-scale set of executable, real-world tools. With three evaluation modes, it tests LLMs from using a minimal, per-question toolset to mounting 550+ tools at once—approaching an OS-like environment. MCPVerse thus provides a realistic, execution-grounded benchmark of current LLM agentic capabilities.
The evaluation system is build on top of CAMEL, thanks to their excellent work.
- Clone the repository
git clone https://github.com/hailsham/mcpverse
cd mcpverse- using uv to install the dependencies
pip install uv
uv venv .venv --python=3.10
source .venv/bin/activate
uv pip install -e ".[all, dev]"In the entire MCP Pool are under mcpverse/tool_full.json, some MCP services also require API Key registration. Below are the links to these APIs.
After get your API keys, edit .env to include your API keys.
SMITHERY_API_KEY = "YOUR_API_KEY"
AMAP_API_KEY = "YOUR_API_KEY"
ALPHAVANTAGE_API_KEY = "YOUR_API_KEY"
RIJKSMUSEUM_API_KEY = "YOUR_API_KEY"
NASA_API_KEY = "YOUR_API_KEY"
VARFLIGHT_API_KEY = "YOUR_API_KEY"To provide your API keys, edit .env to include your LLM API keys.
QWEN_API_BASE_URL="YOUR_API_KEY"
QWEN_API_KEY="YOUR_API_KEY"
ANTHROPIC_API_BASE_URL="YOUR_API_KEY"
ANTHROPIC_API_KEY="YOUR_API_KEY"-
Prepare test data
cd mcpverse chmod +x test_data/git/generate_repo.sh ./test_data/git/generate_repo.sh -
Get reference answers for time-sensitive tasks (skip this step if using time-invariant dataset)
python runner.py --mode get_ref \ --dataset_path data/mcpverse_time_sensitive.csv \ --inout_path results/input_with_ref.csv
-
Quick test (debug mode)
python runner.py --mode debug --model_name deepseek-v3.2
-
Inference
Run Oracle mode with Function Calling (FC):
python runner.py \ --mode infer \ --infer_mode oracle \ --fc_mode FC \ --model_name deepseek-v3.2 \ --judge_model Qwen25-72B \ --dataset_path data/mcpverse_time_invariant_v1.1.csv \ --inout_folder deepseek_v32_oracleRun Standard mode with FC by changing
--infer_modetostandard:--infer_mode standard
Run Standard mode with Prompt-based tool use by changing
--fc_modetoPrompt:--fc_mode Prompt
Run single case, add
--test_id Q101--test_id Q101 -
Evaluate
Configure the judge model in
judger.py, then run:python runner.py \ --mode eval \ --infer_mode oracle \ --fc_mode FC \ --model_name deepseek-v3.2 \ --judge_model Qwen25-72B \ --dataset_path data/mcpverse_time_invariant_v1.1.csv \ --inout_folder deepseek_v32_oracle
Tips:
- Both inference and evaluation stages support automatic resumption from the inout file.
- All outputs are saved in the
outputs/${inout_folder}/directory.
add new model in MCPAgentRunner::_init_model() under mcpverse/mcp_agent_runner.py
self.model = ModelFactory.create(
model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
model_type='Model_Name',
url="Model_Endpoint"
api_key='Your_Key'
)The default score model is Qwen2.5-72B, you can change it by editing mcpverse/judger.py
If you find this work useful, please cite our paper:
@misc{lei2025mcpverse,
title={MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use},
author={Fei Lei and Yibo Yang and Wenxiu Sun and Dahua Lin},
year={2025},
eprint={2508.16260},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.16260},
}