The MCPMark setup supports installation through either pip or MCPMark Docker (recommended) after cloning the code repository.
pip install -e .The MCPMark Docker setup provides a simple way to run evaluation tasks in isolated containers. PostgreSQL is automatically handled when needed.
The official Docker image is automatically pulled from Docker Hub on first use. The image is hosted at: https://hub.docker.com/r/evalsysorg/mcpmark
Image Management:
- The scripts automatically download the image when it's not found locally
- To manually update to the latest version:
docker pull evalsysorg/mcpmark:latest
- For local development/testing, you can build your own docker:
# Creates evalsysorg/mcpmark:latest locally ./build-docker.sh
The run-task.sh script provides simplified Docker usage:
# Run filesystem tasks (filesystem is the default mcp service)
./run-task.sh --models MODEL_NAME --k K
# Run github/notion/postgres/playwright/playwright_webarena with specific task
./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k Kwhere MODEL_NAME refers to the model choice from the supported models (see Introduction Page for more information), EXPNAME refers to customized experiment name, TASK refers to specific task or task group (see tasks/<mcp>/<task_suite>/... for more information), K refers to the time of independent experiments.
Additionally, the run-benchmark.sh script evaluates models across all MCP services:
# Run all services with Docker (recommended)
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker
# Run specific services
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES --docker
# Run with parallel execution for faster results
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker --parallel
# Run locally without Docker
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICESHere MCPSERVICES refers to group of MCP services, separated by comma (e.g. filesystem,postgres)
The benchmark script:
- Runs all or selected MCP services automatically
- Supports progress tracking and timing
- Generates summary reports and logs
- Supports parallel service execution
- Continues running even if some services fail
- Automatically generates performance dashboards
Suppose Notion is the service:
# Build the image first
./build-docker.sh
# Run a task
docker run --rm \
-v $(pwd)/results:/app/results \
-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
-v $(pwd)/notion_state.json:/app/notion_state.json:ro \
evalsysorg/mcpmark:latest \
python3 -m pipeline --mcp notion --models MODEL --exp-name EXPNAME --tasks TASK --k K# The run-task.sh script handles postgres automatically, but if doing manually:
# Start postgres container
docker run -d \
--name mcp-postgres \
--network mcp-network \
-e POSTGRES_DATABASE=postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=123456 \
ghcr.io/cloudnative-pg/postgresql:17-bookworm
# Run postgres task
docker run --rm \
--network mcp-network \
-e POSTGRES_HOST=mcp-postgres \
-v $(pwd)/results:/app/results \
-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
evalsysorg/mcpmark:latest \
python3 -m pipeline --mcp postgres --models MODEL --exp-name EXPNAME --tasks TASK --k K
# Stop and remove postgres when done
docker stop mcp-postgres && docker rm mcp-postgres./run-benchmark.sh --models MODELS --exp-name NAME [OPTIONS]
Required Options:
--models MODELS Comma-separated list of models to evaluate
--exp-name NAME Experiment name for organizing results
Optional Options:
--docker Run tasks in Docker containers (recommended)
--mcps SERVICES Comma-separated list of services to test
Default: filesystem,notion,github,postgres,playwright
--parallel Run services in parallel (experimental)
--timeout SECONDS Timeout per task in seconds (default: 300)
./run-task.sh [--mcp SERVICE] [PIPELINE_ARGS]
Options:
--mcp SERVICE MCP service (notion|github|filesystem|playwright|postgres)
Default: filesystem
Environment Variables:
DOCKER_MEMORY_LIMIT Memory limit for container (default: 4g)
DOCKER_CPU_LIMIT CPU limit for container (default: 2)
DOCKER_IMAGE_VERSION Docker image tag to use (default: latest)
All other arguments are passed directly to the pipeline command.
Pipeline arguments (see python3 -m pipeline --help):
--mcp {notion,github,filesystem,playwright,postgres,playwright_webarena}
MCP service to use (default: filesystem)
--models MODELS Comma-separated list of models to evaluate (e.g., 'o3,k2,gpt-4.1')
--tasks TASKS Tasks to run: "all", a category name, or "category/task_name"
--exp-name EXP_NAME Experiment name; results are saved under results/<exp-name>/ (default: YYYY-MM-DD-HH-MM-SS)
--k K Number of evaluation runs for pass@k metrics (default: 1)
--timeout TIMEOUT Timeout in seconds for each task
--output-dir OUTPUT_DIR
Directory to save results
- Efficiency: Only starts necessary containers
- Isolation: Each task runs in a fresh container
- Resource Management: Automatic cleanup of containers and networks
- Smart Dependencies: PostgreSQL only starts for postgres service
- Parallel Support: Can run multiple services simultaneously for faster benchmarks
- Comprehensive Testing: Benchmark script runs all services with one command
- Progress Tracking: Colored output with timing and status information
- Automatic Reporting: Generates summary reports and performance dashboards
chmod +x run-task.sh# Force rebuild with no cache
./run-task.sh --build --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK# Check if postgres is running
docker ps | grep postgres
# View postgres logs
docker logs mcp-postgres-task# Stop all containers
docker stop $(docker ps -q)
# Remove task network
docker network rm mcp-task-network
# Remove postgres data volume (careful!)
docker volume rm mcp-postgres-dataCreate .mcp_env file with your credentials:
# Service credentials
SOURCE_NOTION_API_KEY=your-key
EVAL_NOTION_API_KEY=your-key
GITHUB_TOKEN=your-token
POSTGRES_PASSWORD=your-password
# Model API keys
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
# ... etcPlease refer to Quick Start for setting up API key for specific model.
docker-compose.yml- Full stack with postgres (for development/testing)
- Results are saved under
./results/<exp-name>/. - Each task runs in an ephemeral container.
- Docker image is shared across all tasks.
- PostgreSQL data persists in Docker volume.