Benchmarking the agent in real-world tasks within a large-scale MCP toolset.
🌐 Website | 📄 Paper | 🤗 Dataset | 🐳 Docker | 🏆 Leaderboard | 🙏 Citation
- [8/18/2025] We releas Docker images and add evaluation results in leaderboard for three new models: GLM 4.5, GPT-5-Mini, and Kimi-K2.
- [8/3/2025] We release the LiveMCPBench.
We recommend using our docker image, but if you want to run the code locally, you will need to install the following tools:
- npm
- uv
-
Pull the docker image
docker pull hysdhlx/livemcpbench:latest
-
Git the repo and run the docker image
git clone https://github.com/icip-cas/LiveMCPBench.git cd LiveMCPBench docker run -itd \ -v "$(pwd):/outside" \ --gpus all \ --ipc=host \ --net=host \ --name LiveMCPBench_container \ hysdhlx/livemcpbench:latest \ bash
-
Prepare the .env file
cp .env_template .env
You can modify the .env file to set your own environment variables.
# MCP Copilot Agent Configuration BASE_URL= OPENAI_API_KEY= MODEL= # Tool Retrieval Configuration EMBEDDING_MODEL= EMBEDDING_BASE_URL= EMBEDDING_API_KEY= EMBEDDING_DIMENSIONS=1024 TOP_SERVERS=5 TOP_TOOLS=3 # Abstract API Configuration (optional) ABSTRACT_MODEL= ABSTRACT_API_KEY= ABSTRACT_BASE_URL= # Proxy Configuration (optional) http_proxy= https_proxy= no_proxy=127.0.0.1,localhost HTTP_PROXY= HTTPS_PROXY= NO_PROXY=127.0.0.1,localhost # lark report (optional) LARK_WEBHOOK_URL=
-
Enter the container & Reset the environment
As we have mounted the code repo to
/outside, you can access the code repo in the container at/outside/.docker exec -it LiveMCPBench_container bashBecause the agent may change the environment, we recommend resetting the environment before running the agent. To reset the environment, you can run the following command:
cd /LiveMCPBench/ bash scripts/env_reset.shThis will copy the repo code in
/outsideto/LiveMCPBenchand link theannotated_datato/root/. -
Check the MCP tools
bash ./tools/scripts/tool_check.sh
After running this command, you can check
./tools/test/tools.jsonto see the tools.You could run this script multiple times if you find some tools are not working.
-
Index the servers
The MCP Copilot Agent requires you have indexed the servers before running. You can run the following command to warm up the agent:
uv run -m baseline.mcp_copilot.arg_generation
bash ./baseline/scripts/run_example.shThis will run the agent with a simple example and save the results in ./baseline/output/.
We default use /root dir to store our data that the agent will access. If you want to run locally, you need to ensure the file in the right path.
-
Run the MCP Copilot Agent
Be sure you have set the environment variables in the .env file.
bash ./baseline/scripts/run_baselines.sh
-
Check the results
After running the agent, you can check the trajectories in
./baseline/output.
-
Modify the
MODELin .env to change evluation models -
Run the evaluation script
bash ./evaluator/scripts/run_baseline.sh
-
Check the results
After running the evaluation, you can check the results in
./evaluator/output. -
Calculate the success rate
uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
LiveMCPBench/
├── annotated_data/ # Tasks and task files
├── baseline/ # MCP Copilot Agent
│ ├── scripts/ # Scripts for running the agent
│ ├── output/ # Output for the agent
│ └── mcp_copilot/ # Source code for the agent
├── evaluator/ # LiveMCPEval
│ ├── scripts/ # Scripts for evaluation
│ └── output/ # Output for evaluation
├── tools/ # LiveMCPTool
│ ├── LiveMCPTool/ # Tool data
│ └── scripts/ # Scripts for the tools
├── scripts/ # Path prepare scripts
├── utils/ # Utility functions
└── .env_template # Template for environment
If you find this project helpful, please use the following to cite it:
@misc{mo2025livemcpbenchagentsnavigateocean,
title={LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?},
author={Guozhao Mo and Wenliang Zhong and Jiawei Chen and Xuanang Chen and Yaojie Lu and Hongyu Lin and Ben He and Xianpei Han and Le Sun},
year={2025},
eprint={2508.01780},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.01780},
}