Benchmark any vision-language model on your own hardware with a single command. vlmbench auto-detects your platform, starts the right backend, and gives you reproducible results as JSON.
- Ollama on macOS: auto-starts, zero config
- vLLM on Linux: via Docker (
--gpus all, auto-pulls) or native vLLM - SGLang on Linux: coming soon
No install needed — just run with uvx:
# Local images/PDFs (macOS Ollama)
uvx vlmbench run -m qwen3-vl:2b -i ./images/
# Linux + vLLM Docker (auto-starts with --gpus all)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/
# HuggingFace dataset (images)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
-d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64
# HuggingFace dataset (text-only — use a column as the prompt)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
-d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""
# Concurrency sweep
uvx vlmbench run -m Qwen/Qwen3-VL-8B-Instruct -i ./images/ \
--concurrency 4,8,16,32,64
# Use a model profile (custom serve args + setup)
uvx vlmbench run --profile deepseek-ocr -i ./images/
# Cloud / remote API (model auto-detected from server)
uvx vlmbench run -i ./images/ \
--base-url https://my-server.example.com/v1 --api-key $API_KEY
# Cloud API with explicit model
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/ \
--base-url https://api.openai.com/v1 --api-key $OPENAI_API_KEYOr install it: pip install vlmbench
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
-d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64 \
--prompt "Describe this image in 80 words or less" \
--concurrency 4,8,16 --backend vllm╭─ Configuration ──────────────────────────────────────────────────────────────╮
│ │
│ model Qwen/Qwen3-VL-2B-Instruct │
│ revision main │
│ backend vLLM 0.11.2 │
│ endpoint http://localhost:8000/v1 │
│ │
│ gpu NVIDIA RTX PRO 6000 Blackwell Workstation Edition │
│ vram 97,887 MiB │
│ driver 580.126.09 │
│ │
│ dataset hf://vlm-run/FineVision-vlmbench-mini │
│ images 64 (mixed) │
│ │
│ max_tokens 2048 │
│ runs 3 │
│ concurrency 8 │
│ │
│ monitor tmux attach -t vlmbench-vllm │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─ Results ────────────────────────────────────────────────────────────────────╮
│ │
│ Metric Value p50 p95 p99 │
│ Throughput 13.33 img/s — — — │
│ Tokens/sec 1168 tok/s — — — │
│ Workers 8 — — — │
│ TTFT 58 ms 51 ms 114 ms 140 ms │
│ TPOT 5.3 ms 5.0 ms 7.3 ms 7.4 ms │
│ Latency (per worker) 0.54 s/img 0.46 s 0.92 s 1.36 s │
│ │
│ Tokens (avg) prompt 2,077 • completion 88 │
│ Token ranges prompt 180–8,545 • completion 55–190 │
│ Images 144 • avg 964×867 (0.93 MP) │
│ Resolution min 338×266 • median 1024×768 • max 2048×1755 │
│ VRAM peak 69.7 GB │
│ Reliability 192/192 ok • 14.4s total │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
Best peak throughput per model on NVIDIA RTX PRO 6000 Blackwell (vLLM v0.15.1, 39 runs across concurrency sweeps):
| # | Model | Best Tok/s | Workers | TTFT | TPOT |
|---|---|---|---|---|---|
| 1 | lightonai/LightOnOCR-2-1B |
2,439.8 | 32 | 1,439 ms | 22.1 ms |
| 2 | Qwen/Qwen3-VL-2B-Instruct |
2,409.3 | 64 | 440 ms | 14.3 ms |
| 3 | PaddlePaddle/PaddleOCR-VL |
2,341.9 | 64 | 6,385 ms | 49.0 ms |
| 4 | deepseek-ai/DeepSeek-OCR |
1,195.8 | 32 | 3,571 ms | 15.9 ms |
| 5 | Qwen/Qwen3-VL-8B-Instruct |
953.8 | 64 | 448 ms | 25.7 ms |
Compare your own results:
uvx vlmbench compare # auto-discovers ~/.vlmbench/benchmarks/
uvx vlmbench compare results/*.json # or pass files explicitlySee MODELS.md for all tested models and their required --serve-args.
Some models need custom Docker images, extra pip installs, or special serve args. Profiles bundle all of this into a single YAML file — just pass --profile and vlmbench handles the rest.
uvx vlmbench profiles # list available profiles
uvx vlmbench run --profile deepseek-ocr -i ./images/ # run with a profileWhen you use --profile, it sets --model, --prompt, --serve-args, and (for Docker builds) the base image and setup commands. You can still override any flag explicitly.
| Profile | Model | Base Image | Custom Setup |
|---|---|---|---|
glm-ocr |
zai-org/GLM-OCR |
vllm/vllm-openai:nightly |
vLLM nightly + transformers >= 5.1.0, MTP speculative decoding |
deepseek-ocr |
deepseek-ai/DeepSeek-OCR |
vllm/vllm-openai:v0.15.1 |
Custom logits processor, no prefix caching |
paddleocr-vl |
PaddlePaddle/PaddleOCR-VL |
vllm/vllm-openai:v0.15.1 |
Trust remote code, no prefix caching |
qwen3-vl-2b |
Qwen/Qwen3-VL-2B-Instruct |
vllm/vllm-openai:v0.15.1 |
— |
qwen3-vl-8b |
Qwen/Qwen3-VL-8B-Instruct |
vllm/vllm-openai:v0.15.1 |
— |
Profiles live in vlmbench/profiles/*.yaml and ship with the package. For local Docker workflows:
make build PROFILE=glm-ocr # generates Dockerfile + docker build
make serve PROFILE=glm-ocr # start server in tmux
make benchmark PROFILE=glm-ocr # run benchmark against the server| Flag | Default | Description |
|---|---|---|
--model / -m |
auto-detect | Model ID. Auto-detected from server if omitted; required only with --serve. |
--profile |
none | Model profile (e.g. glm-ocr). Sets model, prompt, serve-args. See vlmbench profiles. |
--input / -i |
sample URL | File, directory, or URL (images, PDFs, videos) |
--dataset / -d |
none | HuggingFace dataset (e.g. hf://vlm-run/FineVision-vlmbench-mini) |
--dataset-image-col |
auto-detect | Image column name in HF dataset |
--dataset-text-col |
none | Text column name in HF dataset to use as prompt/document input |
--dataset-split |
train |
Dataset split to load |
--base-url |
auto-detect | OpenAI-compatible base URL |
--api-key |
no-key |
API key (also reads OPENAI_API_KEY env) |
--prompt |
"Extract all text..." |
Prompt/instruction sent with each input. Pass "" to use the text column as the full message. |
--max-tokens |
2048 |
Max completion tokens |
--runs |
3 |
Timed runs per input |
--warmup |
1 |
Warmup runs (not recorded, fail-fast on errors) |
--concurrency |
8 |
Single value or comma-separated sweep (e.g. 4,8,16,32,64) |
--max-samples |
all | Limit number of input samples (useful for dry-runs) |
--output-directory |
~/.vlmbench/benchmarks/ |
Output directory |
--tag |
none | Custom label (used in result filename and metadata) |
--upload |
off | Upload results to HuggingFace (requires HF_TOKEN) |
--upload-repo |
vlm-run/vlmbench-results |
HuggingFace dataset repo for uploads |
--backend |
auto |
auto, ollama, vllm, vllm-openai:<tag>, sglang:<tag> |
--serve/--no-serve |
--serve |
Auto-start server if none detected |
--serve-args |
none | Extra args passed to server |
--quant |
auto |
Quantization metadata: fp16, bf16, q4_K_M, etc. |
--revision |
main |
Model revision metadata |
--backend |
Resolves to | Serving |
|---|---|---|
auto |
ollama on macOS, vllm-openai:latest on Linux |
Native / Docker |
ollama |
Ollama native | ollama serve in tmux |
vllm |
Native vLLM | vllm serve in tmux |
vllm-openai:latest |
vllm/vllm-openai:latest |
docker run --gpus all |
vllm-openai:nightly |
vllm/vllm-openai:nightly |
docker run --gpus all |
sglang:latest |
lmsysorg/sglang:latest |
docker run --gpus all (coming soon) |
All Docker backends run with --gpus all --ipc=host and a deterministic container name for easy log access.
| Type | Source | Processing |
|---|---|---|
| Image | --input (.png, .jpg, .jpeg, .webp, .tiff, .bmp) |
Base64 encode |
--input (.pdf) |
pypdfium2 per-page → base64 |
|
| Video | --input (.mp4, .mov, .avi, .mkv, .webm) |
ffmpeg 1fps → frames → base64 |
| HF image dataset | --dataset hf://... |
Auto-detect image column, base64 encode |
| HF text dataset | --dataset hf://... --dataset-text-col <col> |
Each row's value sent as a text content block |
Directories are processed recursively, sorted alphabetically.
For LLM (non-vision) benchmarks, use an HF dataset with a text column:
# Each row's "prompt" column is the full message (--prompt "" = no instruction appended)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
-d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""
# Each row's "text" column is the document; --prompt is the instruction
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
-d hf://my-org/my-docs --dataset-text-col text \
--prompt "Summarize the above in one sentence."Auto-detection falls back to text columns (named text, prompt, input, content, query, question, instruction) when no image column is found.
Results are saved as JSON to ~/.vlmbench/benchmarks/ with model metadata, environment info, benchmark stats (TTFT, TPOT, throughput, latency percentiles), and raw per-run data. Each concurrency level produces a separate file.
Upload results to HuggingFace with --upload:
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -d hf://vlm-run/FineVision-vlmbench-mini \
--concurrency 4,8,16,32,64 --uploadBrowse uploaded results at vlm-run/vlmbench-results.
When you run vlmbench run, here's what happens:
- Detects your platform — macOS routes to Ollama, Linux to vLLM Docker
- Pulls the Docker image —
docker pull vllm/vllm-openai:latest(cached after first run) - Starts the server in tmux —
docker run --gpus allin a named session (vlmbench-vllm) - Launches a GPU monitor —
nvitop(Linux) ormacmon(macOS) in a split pane - Waits for the server — polls
/v1/modelsuntil ready (up to 600s) - Runs warmup requests — fail-fast validation before timed runs
- Benchmarks with concurrency — streams completions via the OpenAI API, measures TTFT/TPOT/throughput
- Saves results as JSON — one file per concurrency level in
~/.vlmbench/benchmarks/
Attach to the live session anytime: tmux attach -t vlmbench-vllm
tmux session capture — server logs + GPU monitor side by side
Top pane — vLLM server logs:
(APIServer pid=1) INFO 02-07 15:44:24 non-default args: {
'model': 'lightonai/LightOnOCR-2-1B',
'enable_prefix_caching': False,
'limit_mm_per_prompt': {'image': 1},
'mm_processor_cache_gb': 0.0
}
(APIServer pid=1) INFO 02-07 15:44:34 Resolved architecture: LightOnOCRForConditionalGeneration
(APIServer pid=1) INFO 02-07 15:44:34 Using max model len 16384
(EngineCore pid=272) INFO 02-07 15:44:44 Initializing a V1 LLM engine (v0.15.1) with config:
model='lightonai/LightOnOCR-2-1B', dtype=torch.bfloat16, max_seq_len=16384,
tensor_parallel_size=1, quantization=None
(EngineCore pid=272) INFO 02-07 15:45:41 Loading weights took 0.49 seconds
(EngineCore pid=272) INFO 02-07 15:45:42 Model loading took 1.88 GiB memory and 22.15 seconds
(EngineCore pid=272) INFO 02-07 15:46:11 Available KV cache memory: 77.94 GiB
(EngineCore pid=272) INFO 02-07 15:46:11 Maximum concurrency for 16,384 tokens per request: 44.53x
Capturing CUDA graphs (decode, FULL): 100% |██████████| 51/51
(APIServer pid=1) INFO Started server process [1]
(APIServer pid=1) INFO Application startup complete.
(APIServer pid=1) INFO 172.17.0.1 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Bottom pane — nvitop GPU monitor:
NVITOP 1.6.2 Driver Version: 580.126.09 CUDA Driver Version: 13.0
╒═══════════════════════════════╤══════════════════════╤══════════════════════╕
│ GPU Name Persistence-M│ Bus-Id Disp.A │ Volatile Uncorr. ECC │
│ Fan Temp Perf Pwr:Usage/Cap│ Memory-Usage │ GPU-Util Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│ 0 GeForce RTX 2080 Ti Off │ 00000000:21:00.0 Off │ N/A │
│ 27% 42C P8 17W / 250W │ 107.2MiB / 11264MiB │ 0% Default │
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│ 1 RTX PRO 6000 Off │ 00000000:4B:00.0 Off │ N/A │
│ 30% 33C P1 66W / 600W │ 86.54GiB / 95.59GiB │ 0% Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛
MEM: ███████████████████████████████████████████████████████████▏ 90.5%
Load Average: 4.14 2.73 1.65
Install vlmbench as a Claude Code plugin:
# 1. Register the marketplace
/plugin marketplace add vlm-run/vlmbench
# 2. Install the skill
/plugin install vlmbench@vlm-run/vlmbenchAfter restarting Claude Code, the vlmbench skill will be available. Mention it directly in your instructions to benchmark models, compare results, or debug server issues.
- Python >= 3.11, uv recommended
- Linux: Docker + NVIDIA GPU support (or native vLLM via
uv pip install vllm) - Monitoring:
tmux,nvitop(Linux) ormacmon(macOS) - Optional:
ffmpeg(video frame extraction)
