Skip to content

SenseTime-FVG/mcpverse

 
 

Repository files navigation

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

python

arXiv

Updates

2025.12.22

Some MCP servers in our original tool pool have been deprecated or taken offline, which makes a subset of questions in the original dataset no longer executable/reproducible. To address this, we release an updated version of the dataset. Please refer to mcpverse/data/mcpverse_time_invariant_v1.1.csv. Note that this version only includes time-invariant questions.

Overview

MCPVerse Overview MCPVerse is a comprehensive benchmark built on a large-scale set of executable, real-world tools. With three evaluation modes, it tests LLMs from using a minimal, per-question toolset to mounting 550+ tools at once—approaching an OS-like environment. MCPVerse thus provides a realistic, execution-grounded benchmark of current LLM agentic capabilities.

The evaluation system is build on top of CAMEL, thanks to their excellent work.

Installation

  1. Clone the repository
git clone https://github.com/hailsham/mcpverse
cd mcpverse
  1. using uv to install the dependencies
pip install uv
uv venv .venv --python=3.10
source .venv/bin/activate

uv pip install -e ".[all, dev]"

Quick Start

Setup API Keys

MCP Service API Keys (Required)

In the entire MCP Pool are under mcpverse/tool_full.json, some MCP services also require API Key registration. Below are the links to these APIs.

After get your API keys, edit .env to include your API keys.

SMITHERY_API_KEY = "YOUR_API_KEY"
AMAP_API_KEY = "YOUR_API_KEY"
ALPHAVANTAGE_API_KEY = "YOUR_API_KEY"
RIJKSMUSEUM_API_KEY = "YOUR_API_KEY"
NASA_API_KEY = "YOUR_API_KEY"
VARFLIGHT_API_KEY = "YOUR_API_KEY"

LLM API Keys

To provide your API keys, edit .env to include your LLM API keys.

QWEN_API_BASE_URL="YOUR_API_KEY"
QWEN_API_KEY="YOUR_API_KEY"
ANTHROPIC_API_BASE_URL="YOUR_API_KEY"
ANTHROPIC_API_KEY="YOUR_API_KEY"

Run Evaluation

Preparation

  1. Prepare test data

    cd mcpverse
    chmod +x test_data/git/generate_repo.sh
    ./test_data/git/generate_repo.sh
  2. Get reference answers for time-sensitive tasks (skip this step if using time-invariant dataset)

    python runner.py --mode get_ref \
        --dataset_path data/mcpverse_time_sensitive.csv \
        --inout_path results/input_with_ref.csv

Running

  1. Quick test (debug mode)

    python runner.py --mode debug --model_name deepseek-v3.2 
  2. Inference

    Run Oracle mode with Function Calling (FC):

    python runner.py \
        --mode infer \
        --infer_mode oracle \
        --fc_mode FC \
        --model_name deepseek-v3.2 \
        --judge_model Qwen25-72B \
        --dataset_path data/mcpverse_time_invariant_v1.1.csv \
        --inout_folder deepseek_v32_oracle

    Run Standard mode with FC by changing --infer_mode to standard:

    --infer_mode standard

    Run Standard mode with Prompt-based tool use by changing --fc_mode to Prompt:

    --fc_mode Prompt

    Run single case, add --test_id Q101

    --test_id Q101    
    
  3. Evaluate

    Configure the judge model in judger.py, then run:

    python runner.py \
        --mode eval \  
        --infer_mode oracle \
        --fc_mode FC \
        --model_name deepseek-v3.2 \
        --judge_model Qwen25-72B \
        --dataset_path data/mcpverse_time_invariant_v1.1.csv \
        --inout_folder deepseek_v32_oracle

Tips:

  1. Both inference and evaluation stages support automatic resumption from the inout file.
  2. All outputs are saved in the outputs/${inout_folder}/ directory.

Add New Model

add new model in MCPAgentRunner::_init_model() under mcpverse/mcp_agent_runner.py

self.model = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type='Model_Name',
    url="Model_Endpoint"
    api_key='Your_Key'
)

Change Score Model

The default score model is Qwen2.5-72B, you can change it by editing mcpverse/judger.py

Citation

If you find this work useful, please cite our paper:

@misc{lei2025mcpverse,
    title={MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use}, 
    author={Fei Lei and Yibo Yang and Wenxiu Sun and Dahua Lin},
    year={2025},
    eprint={2508.16260},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2508.16260}, 
}

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Other 0.4%