Skip to content

Latest commit

 

History

History
238 lines (180 loc) · 7.36 KB

File metadata and controls

238 lines (180 loc) · 7.36 KB

Installation and Docker Task Usage Guideline

Overview

The MCPMark setup supports installation through either pip or MCPMark Docker (recommended) after cloning the code repository.

Pip Installtion

pip install -e .

The MCPMark Docker setup provides a simple way to run evaluation tasks in isolated containers. PostgreSQL is automatically handled when needed.

1. Quick Start

1.1 Docker Image

The official Docker image is automatically pulled from Docker Hub on first use. The image is hosted at: https://hub.docker.com/r/evalsysorg/mcpmark

Image Management:

  • The scripts automatically download the image when it's not found locally
  • To manually update to the latest version:
    docker pull evalsysorg/mcpmark:latest
  • For local development/testing, you can build your own docker:
     # Creates evalsysorg/mcpmark:latest locally
    ./build-docker.sh

2. Running MCP Experiments

2.1 Running Individual MCP Experiment

The run-task.sh script provides simplified Docker usage:

# Run filesystem tasks (filesystem is the default mcp service)
./run-task.sh --models MODEL_NAME --k K

# Run github/notion/postgres/playwright/playwright_webarena with specific task
./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K

where MODEL_NAME refers to the model choice from the supported models (see Introduction Page for more information), EXPNAME refers to customized experiment name, TASK refers to specific task or task group (see tasks/<mcp>/<task_suite>/... for more information), K refers to the time of independent experiments.

Additionally, the run-benchmark.sh script evaluates models across all MCP services:

# Run all services with Docker (recommended)
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker

# Run specific services
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES --docker

# Run with parallel execution for faster results
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker --parallel

# Run locally without Docker
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES

Here MCPSERVICES refers to group of MCP services, separated by comma (e.g. filesystem,postgres)

The benchmark script:

  • Runs all or selected MCP services automatically
  • Supports progress tracking and timing
  • Generates summary reports and logs
  • Supports parallel service execution
  • Continues running even if some services fail
  • Automatically generates performance dashboards

Manual Docker Commands

For Non-Postgres Services

Suppose Notion is the service:

# Build the image first
./build-docker.sh

# Run a task
docker run --rm \
  -v $(pwd)/results:/app/results \
  -v $(pwd)/.mcp_env:/app/.mcp_env:ro \
  -v $(pwd)/notion_state.json:/app/notion_state.json:ro \
  evalsysorg/mcpmark:latest \
  python3 -m pipeline --mcp notion --models MODEL --exp-name EXPNAME --tasks TASK --k K

For Postgres Service

# The run-task.sh script handles postgres automatically, but if doing manually:

# Start postgres container
docker run -d \
  --name mcp-postgres \
  --network mcp-network \
  -e POSTGRES_DATABASE=postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=123456 \
  ghcr.io/cloudnative-pg/postgresql:17-bookworm

# Run postgres task
docker run --rm \
  --network mcp-network \
  -e POSTGRES_HOST=mcp-postgres \
  -v $(pwd)/results:/app/results \
  -v $(pwd)/.mcp_env:/app/.mcp_env:ro \
  evalsysorg/mcpmark:latest \
  python3 -m pipeline --mcp postgres --models MODEL --exp-name EXPNAME --tasks TASK --k K

# Stop and remove postgres when done
docker stop mcp-postgres && docker rm mcp-postgres

Script Usage

Benchmark Runner (run-benchmark.sh)

./run-benchmark.sh --models MODELS --exp-name NAME [OPTIONS]

Required Options:
    --models MODELS      Comma-separated list of models to evaluate
    --exp-name NAME     Experiment name for organizing results

Optional Options:
    --docker            Run tasks in Docker containers (recommended)
    --mcps SERVICES Comma-separated list of services to test
                        Default: filesystem,notion,github,postgres,playwright
    --parallel          Run services in parallel (experimental)
    --timeout SECONDS   Timeout per task in seconds (default: 300)

Individual Task Runner (run-task.sh)

./run-task.sh [--mcp SERVICE] [PIPELINE_ARGS]

Options:
    --mcp SERVICE    MCP service (notion|github|filesystem|playwright|postgres)
                        Default: filesystem

Environment Variables:
    DOCKER_MEMORY_LIMIT  Memory limit for container (default: 4g)
    DOCKER_CPU_LIMIT     CPU limit for container (default: 2)
    DOCKER_IMAGE_VERSION Docker image tag to use (default: latest)

All other arguments are passed directly to the pipeline command.

Pipeline arguments (see python3 -m pipeline --help):
  --mcp {notion,github,filesystem,playwright,postgres,playwright_webarena}
                        MCP service to use (default: filesystem)
  --models MODELS       Comma-separated list of models to evaluate (e.g., 'o3,k2,gpt-4.1')
  --tasks TASKS         Tasks to run: "all", a category name, or "category/task_name"
  --exp-name EXP_NAME   Experiment name; results are saved under results/<exp-name>/ (default: YYYY-MM-DD-HH-MM-SS)
  --k K                 Number of evaluation runs for pass@k metrics (default: 1)
  --timeout TIMEOUT     Timeout in seconds for each task
  --output-dir OUTPUT_DIR
                        Directory to save results

Docker Benefits

  1. Efficiency: Only starts necessary containers
  2. Isolation: Each task runs in a fresh container
  3. Resource Management: Automatic cleanup of containers and networks
  4. Smart Dependencies: PostgreSQL only starts for postgres service
  5. Parallel Support: Can run multiple services simultaneously for faster benchmarks
  6. Comprehensive Testing: Benchmark script runs all services with one command
  7. Progress Tracking: Colored output with timing and status information
  8. Automatic Reporting: Generates summary reports and performance dashboards

Common Troubleshooting

Permission Issues

chmod +x run-task.sh

Docker Build Issues

# Force rebuild with no cache
./run-task.sh --build --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK

PostgreSQL Connection Issues

# Check if postgres is running
docker ps | grep postgres

# View postgres logs
docker logs mcp-postgres-task

Cleanup Stuck Resources

# Stop all containers
docker stop $(docker ps -q)

# Remove task network
docker network rm mcp-task-network

# Remove postgres data volume (careful!)
docker volume rm mcp-postgres-data

Environment Variables

Create .mcp_env file with your credentials:

# Service credentials
SOURCE_NOTION_API_KEY=your-key
EVAL_NOTION_API_KEY=your-key
GITHUB_TOKEN=your-token
POSTGRES_PASSWORD=your-password

# Model API keys
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
# ... etc

Please refer to Quick Start for setting up API key for specific model.

Docker Compose Files

  • docker-compose.yml - Full stack with postgres (for development/testing)

Notes

  • Results are saved under ./results/<exp-name>/.
  • Each task runs in an ephemeral container.
  • Docker image is shared across all tasks.
  • PostgreSQL data persists in Docker volume.