Skip to content

vlm-run/vlmbench

Repository files navigation

VLM Run Logo

vlmbench

Single-file, drop-in VLM benchmark CLI for your agents.

PyPI Version Python Versions PyPI Downloads
License Discord Twitter Follow

Benchmark any vision-language model on your own hardware with a single command. vlmbench auto-detects your platform, starts the right backend, and gives you reproducible results as JSON.

  • Ollama on macOS: auto-starts, zero config
  • vLLM on Linux: via Docker (--gpus all, auto-pulls) or native vLLM
  • SGLang on Linux: coming soon

image

Quick Start

No install needed — just run with uvx:

# Local images/PDFs (macOS Ollama)
uvx vlmbench run -m qwen3-vl:2b -i ./images/

# Linux + vLLM Docker (auto-starts with --gpus all)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/

# HuggingFace dataset (images)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64

# HuggingFace dataset (text-only — use a column as the prompt)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Concurrency sweep
uvx vlmbench run -m Qwen/Qwen3-VL-8B-Instruct -i ./images/ \
  --concurrency 4,8,16,32,64

# Use a model profile (custom serve args + setup)
uvx vlmbench run --profile deepseek-ocr -i ./images/

# Cloud / remote API (model auto-detected from server)
uvx vlmbench run -i ./images/ \
  --base-url https://my-server.example.com/v1 --api-key $API_KEY

# Cloud API with explicit model
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/ \
  --base-url https://api.openai.com/v1 --api-key $OPENAI_API_KEY

Or install it: pip install vlmbench

Example Run

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64 \
  --prompt "Describe this image in 80 words or less" \
  --concurrency 4,8,16 --backend vllm
╭─ Configuration ──────────────────────────────────────────────────────────────╮
│                                                                              │
│  model        Qwen/Qwen3-VL-2B-Instruct                                      │
│  revision     main                                                           │
│  backend      vLLM 0.11.2                                                    │
│  endpoint     http://localhost:8000/v1                                       │
│                                                                              │
│  gpu          NVIDIA RTX PRO 6000 Blackwell Workstation Edition              │
│  vram         97,887 MiB                                                     │
│  driver       580.126.09                                                     │
│                                                                              │
│  dataset      hf://vlm-run/FineVision-vlmbench-mini                          │
│  images       64 (mixed)                                                     │
│                                                                              │
│  max_tokens   2048                                                           │
│  runs         3                                                              │
│  concurrency  8                                                              │
│                                                                              │
│  monitor      tmux attach -t vlmbench-vllm                                   │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─ Results ────────────────────────────────────────────────────────────────────╮
│                                                                              │
│  Metric                Value              p50        p95        p99          │
│  Throughput            13.33 img/s         —          —          —           │
│  Tokens/sec            1168 tok/s          —          —          —           │
│  Workers               8                   —          —          —           │
│  TTFT                  58 ms           51 ms     114 ms     140 ms           │
│  TPOT                  5.3 ms         5.0 ms     7.3 ms     7.4 ms           │
│  Latency (per worker)  0.54 s/img     0.46 s     0.92 s     1.36 s           │
│                                                                              │
│  Tokens (avg)          prompt 2,077  •  completion 88                        │
│  Token ranges          prompt 180–8,545  •  completion 55–190                │
│  Images                144  •  avg 964×867 (0.93 MP)                         │
│  Resolution            min 338×266  •  median 1024×768  •  max 2048×1755     │
│  VRAM peak             69.7 GB                                               │
│  Reliability           192/192 ok  •  14.4s total                            │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Leaderboard

Best peak throughput per model on NVIDIA RTX PRO 6000 Blackwell (vLLM v0.15.1, 39 runs across concurrency sweeps):

# Model Best Tok/s Workers TTFT TPOT
1 lightonai/LightOnOCR-2-1B 2,439.8 32 1,439 ms 22.1 ms
2 Qwen/Qwen3-VL-2B-Instruct 2,409.3 64 440 ms 14.3 ms
3 PaddlePaddle/PaddleOCR-VL 2,341.9 64 6,385 ms 49.0 ms
4 deepseek-ai/DeepSeek-OCR 1,195.8 32 3,571 ms 15.9 ms
5 Qwen/Qwen3-VL-8B-Instruct 953.8 64 448 ms 25.7 ms

Compare your own results:

uvx vlmbench compare                       # auto-discovers ~/.vlmbench/benchmarks/
uvx vlmbench compare results/*.json        # or pass files explicitly

See MODELS.md for all tested models and their required --serve-args.

Profiles

Some models need custom Docker images, extra pip installs, or special serve args. Profiles bundle all of this into a single YAML file — just pass --profile and vlmbench handles the rest.

uvx vlmbench profiles                                  # list available profiles
uvx vlmbench run --profile deepseek-ocr -i ./images/   # run with a profile

When you use --profile, it sets --model, --prompt, --serve-args, and (for Docker builds) the base image and setup commands. You can still override any flag explicitly.

Profile Model Base Image Custom Setup
glm-ocr zai-org/GLM-OCR vllm/vllm-openai:nightly vLLM nightly + transformers >= 5.1.0, MTP speculative decoding
deepseek-ocr deepseek-ai/DeepSeek-OCR vllm/vllm-openai:v0.15.1 Custom logits processor, no prefix caching
paddleocr-vl PaddlePaddle/PaddleOCR-VL vllm/vllm-openai:v0.15.1 Trust remote code, no prefix caching
qwen3-vl-2b Qwen/Qwen3-VL-2B-Instruct vllm/vllm-openai:v0.15.1
qwen3-vl-8b Qwen/Qwen3-VL-8B-Instruct vllm/vllm-openai:v0.15.1

Profiles live in vlmbench/profiles/*.yaml and ship with the package. For local Docker workflows:

make build PROFILE=glm-ocr        # generates Dockerfile + docker build
make serve PROFILE=glm-ocr        # start server in tmux
make benchmark PROFILE=glm-ocr    # run benchmark against the server

CLI Reference

Flag Default Description
--model / -m auto-detect Model ID. Auto-detected from server if omitted; required only with --serve.
--profile none Model profile (e.g. glm-ocr). Sets model, prompt, serve-args. See vlmbench profiles.
--input / -i sample URL File, directory, or URL (images, PDFs, videos)
--dataset / -d none HuggingFace dataset (e.g. hf://vlm-run/FineVision-vlmbench-mini)
--dataset-image-col auto-detect Image column name in HF dataset
--dataset-text-col none Text column name in HF dataset to use as prompt/document input
--dataset-split train Dataset split to load
--base-url auto-detect OpenAI-compatible base URL
--api-key no-key API key (also reads OPENAI_API_KEY env)
--prompt "Extract all text..." Prompt/instruction sent with each input. Pass "" to use the text column as the full message.
--max-tokens 2048 Max completion tokens
--runs 3 Timed runs per input
--warmup 1 Warmup runs (not recorded, fail-fast on errors)
--concurrency 8 Single value or comma-separated sweep (e.g. 4,8,16,32,64)
--max-samples all Limit number of input samples (useful for dry-runs)
--output-directory ~/.vlmbench/benchmarks/ Output directory
--tag none Custom label (used in result filename and metadata)
--upload off Upload results to HuggingFace (requires HF_TOKEN)
--upload-repo vlm-run/vlmbench-results HuggingFace dataset repo for uploads
--backend auto auto, ollama, vllm, vllm-openai:<tag>, sglang:<tag>
--serve/--no-serve --serve Auto-start server if none detected
--serve-args none Extra args passed to server
--quant auto Quantization metadata: fp16, bf16, q4_K_M, etc.
--revision main Model revision metadata

Backends

--backend Resolves to Serving
auto ollama on macOS, vllm-openai:latest on Linux Native / Docker
ollama Ollama native ollama serve in tmux
vllm Native vLLM vllm serve in tmux
vllm-openai:latest vllm/vllm-openai:latest docker run --gpus all
vllm-openai:nightly vllm/vllm-openai:nightly docker run --gpus all
sglang:latest lmsysorg/sglang:latest docker run --gpus all (coming soon)

All Docker backends run with --gpus all --ipc=host and a deterministic container name for easy log access.

Input Types

Type Source Processing
Image --input (.png, .jpg, .jpeg, .webp, .tiff, .bmp) Base64 encode
PDF --input (.pdf) pypdfium2 per-page → base64
Video --input (.mp4, .mov, .avi, .mkv, .webm) ffmpeg 1fps → frames → base64
HF image dataset --dataset hf://... Auto-detect image column, base64 encode
HF text dataset --dataset hf://... --dataset-text-col <col> Each row's value sent as a text content block

Directories are processed recursively, sorted alphabetically.

Text-only benchmarks

For LLM (non-vision) benchmarks, use an HF dataset with a text column:

# Each row's "prompt" column is the full message (--prompt "" = no instruction appended)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Each row's "text" column is the document; --prompt is the instruction
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-docs --dataset-text-col text \
  --prompt "Summarize the above in one sentence."

Auto-detection falls back to text columns (named text, prompt, input, content, query, question, instruction) when no image column is found.

Output

Results are saved as JSON to ~/.vlmbench/benchmarks/ with model metadata, environment info, benchmark stats (TTFT, TPOT, throughput, latency percentiles), and raw per-run data. Each concurrency level produces a separate file.

Upload results to HuggingFace with --upload:

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -d hf://vlm-run/FineVision-vlmbench-mini \
  --concurrency 4,8,16,32,64 --upload

Browse uploaded results at vlm-run/vlmbench-results.

How It Works

When you run vlmbench run, here's what happens:

  1. Detects your platform — macOS routes to Ollama, Linux to vLLM Docker
  2. Pulls the Docker imagedocker pull vllm/vllm-openai:latest (cached after first run)
  3. Starts the server in tmuxdocker run --gpus all in a named session (vlmbench-vllm)
  4. Launches a GPU monitornvitop (Linux) or macmon (macOS) in a split pane
  5. Waits for the server — polls /v1/models until ready (up to 600s)
  6. Runs warmup requests — fail-fast validation before timed runs
  7. Benchmarks with concurrency — streams completions via the OpenAI API, measures TTFT/TPOT/throughput
  8. Saves results as JSON — one file per concurrency level in ~/.vlmbench/benchmarks/

Attach to the live session anytime: tmux attach -t vlmbench-vllm

tmux session capture — server logs + GPU monitor side by side

Top pane — vLLM server logs:

(APIServer pid=1) INFO 02-07 15:44:24 non-default args: {
  'model': 'lightonai/LightOnOCR-2-1B',
  'enable_prefix_caching': False,
  'limit_mm_per_prompt': {'image': 1},
  'mm_processor_cache_gb': 0.0
}
(APIServer pid=1) INFO 02-07 15:44:34 Resolved architecture: LightOnOCRForConditionalGeneration
(APIServer pid=1) INFO 02-07 15:44:34 Using max model len 16384
(EngineCore pid=272) INFO 02-07 15:44:44 Initializing a V1 LLM engine (v0.15.1) with config:
  model='lightonai/LightOnOCR-2-1B', dtype=torch.bfloat16, max_seq_len=16384,
  tensor_parallel_size=1, quantization=None
(EngineCore pid=272) INFO 02-07 15:45:41 Loading weights took 0.49 seconds
(EngineCore pid=272) INFO 02-07 15:45:42 Model loading took 1.88 GiB memory and 22.15 seconds
(EngineCore pid=272) INFO 02-07 15:46:11 Available KV cache memory: 77.94 GiB
(EngineCore pid=272) INFO 02-07 15:46:11 Maximum concurrency for 16,384 tokens per request: 44.53x
Capturing CUDA graphs (decode, FULL): 100% |██████████| 51/51
(APIServer pid=1) INFO Started server process [1]
(APIServer pid=1) INFO Application startup complete.
(APIServer pid=1) INFO 172.17.0.1 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Bottom pane — nvitop GPU monitor:

NVITOP 1.6.2      Driver Version: 580.126.09      CUDA Driver Version: 13.0
╒═══════════════════════════════╤══════════════════════╤══════════════════════╕
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│   0  GeForce RTX 2080 Ti  Off │ 00000000:21:00.0 Off │                  N/A │
│ 27%   42C   P8     17W / 250W │  107.2MiB / 11264MiB │      0%      Default │
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│   1  RTX PRO 6000         Off │ 00000000:4B:00.0 Off │                  N/A │
│ 30%   33C   P1     66W / 600W │  86.54GiB / 95.59GiB │      0%      Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛
  MEM: ███████████████████████████████████████████████████████████▏ 90.5%
  Load Average: 4.14  2.73  1.65

Claude Code Installation

Install vlmbench as a Claude Code plugin:

# 1. Register the marketplace
/plugin marketplace add vlm-run/vlmbench

# 2. Install the skill
/plugin install vlmbench@vlm-run/vlmbench

After restarting Claude Code, the vlmbench skill will be available. Mention it directly in your instructions to benchmark models, compare results, or debug server issues.

Requirements

  • Python >= 3.11, uv recommended
  • Linux: Docker + NVIDIA GPU support (or native vLLM via uv pip install vllm)
  • Monitoring: tmux, nvitop (Linux) or macmon (macOS)
  • Optional: ffmpeg (video frame extraction)

About

Single-file, drop-in VLM benchmark CLI for your agents.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors