MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Updates

2025.12.22

Some MCP servers in our original tool pool have been deprecated or taken offline, which makes a subset of questions in the original dataset no longer executable/reproducible. To address this, we release an updated version of the dataset. Please refer to mcpverse/data/mcpverse_time_invariant_v1.1.csv. Note that this version only includes time-invariant questions.

Overview

MCPVerse is a comprehensive benchmark built on a large-scale set of executable, real-world tools. With three evaluation modes, it tests LLMs from using a minimal, per-question toolset to mounting 550+ tools at once—approaching an OS-like environment. MCPVerse thus provides a realistic, execution-grounded benchmark of current LLM agentic capabilities.

The evaluation system is build on top of CAMEL, thanks to their excellent work.

Installation

Clone the repository

git clone https://github.com/hailsham/mcpverse
cd mcpverse

using uv to install the dependencies

pip install uv
uv venv .venv --python=3.10
source .venv/bin/activate

uv pip install -e ".[all, dev]"

Quick Start

Setup API Keys

MCP Service API Keys (Required)

In the entire MCP Pool are under mcpverse/tool_full.json, some MCP services also require API Key registration. Below are the links to these APIs.

After get your API keys, edit .env to include your API keys.

SMITHERY_API_KEY = "YOUR_API_KEY"
AMAP_API_KEY = "YOUR_API_KEY"
ALPHAVANTAGE_API_KEY = "YOUR_API_KEY"
RIJKSMUSEUM_API_KEY = "YOUR_API_KEY"
NASA_API_KEY = "YOUR_API_KEY"
VARFLIGHT_API_KEY = "YOUR_API_KEY"

LLM API Keys

To provide your API keys, edit .env to include your LLM API keys.

QWEN_API_BASE_URL="YOUR_API_KEY"
QWEN_API_KEY="YOUR_API_KEY"
ANTHROPIC_API_BASE_URL="YOUR_API_KEY"
ANTHROPIC_API_KEY="YOUR_API_KEY"

Run Evaluation

Preparation

Prepare test data

cd mcpverse
chmod +x test_data/git/generate_repo.sh
./test_data/git/generate_repo.sh

Get reference answers for time-sensitive tasks (skip this step if using time-invariant dataset)

python runner.py --mode get_ref \
    --dataset_path data/mcpverse_time_sensitive.csv \
    --inout_path results/input_with_ref.csv

Running

Quick test (debug mode)

python runner.py --mode debug --model_name deepseek-v3.2

Inference

Run Oracle mode with Function Calling (FC):

python runner.py \
    --mode infer \
    --infer_mode oracle \
    --fc_mode FC \
    --model_name deepseek-v3.2 \
    --judge_model Qwen25-72B \
    --dataset_path data/mcpverse_time_invariant_v1.1.csv \
    --inout_folder deepseek_v32_oracle

Run Standard mode with FC by changing --infer_mode to standard:

--infer_mode standard

Run Standard mode with Prompt-based tool use by changing --fc_mode to Prompt:

--fc_mode Prompt

Run single case, add --test_id Q101

--test_id Q101

Evaluate

Configure the judge model in judger.py, then run:

python runner.py \
    --mode eval \  
    --infer_mode oracle \
    --fc_mode FC \
    --model_name deepseek-v3.2 \
    --judge_model Qwen25-72B \
    --dataset_path data/mcpverse_time_invariant_v1.1.csv \
    --inout_folder deepseek_v32_oracle

Tips:

Both inference and evaluation stages support automatic resumption from the inout file.
All outputs are saved in the outputs/${inout_folder}/ directory.

Add New Model

add new model in MCPAgentRunner::_init_model() under mcpverse/mcp_agent_runner.py

self.model = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type='Model_Name',
    url="Model_Endpoint"
    api_key='Your_Key'
)

Change Score Model

The default score model is Qwen2.5-72B, you can change it by editing mcpverse/judger.py

Citation

If you find this work useful, please cite our paper:

@misc{lei2025mcpverse,
    title={MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use}, 
    author={Fei Lei and Yibo Yang and Wenxiu Sun and Dahua Lin},
    year={2025},
    eprint={2508.16260},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2508.16260}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.container		.container
apps		apps
assets		assets
camel		camel
licenses		licenses
mcpverse		mcpverse
test		test
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.style.yapf		.style.yapf
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Updates

2025.12.22

Overview

Installation

Quick Start

Setup API Keys

MCP Service API Keys (Required)

LLM API Keys

Run Evaluation

Preparation

Running

Add New Model

Change Score Model

Citation

About

Uh oh!

Releases

Packages

Languages

License

SenseTime-FVG/mcpverse

Folders and files

Latest commit

History

Repository files navigation

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Updates

2025.12.22

Overview

Installation

Quick Start

Setup API Keys

MCP Service API Keys (Required)

LLM API Keys

Run Evaluation

Preparation

Running

Add New Model

Change Score Model

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages