GitHub - vlm-run/vlmbench: Single-file, drop-in VLM benchmark CLI for your agents.

vlmbench

Single-file, drop-in VLM benchmark CLI for your agents.

Benchmark any vision-language model on your own hardware with a single command. vlmbench auto-detects your platform, starts the right backend, and gives you reproducible results as JSON.

Ollama on macOS: auto-starts, zero config
vLLM on Linux: via Docker (--gpus all, auto-pulls) or native vLLM
SGLang on Linux: coming soon

Quick Start

No install needed — just run with uvx:

# Local images/PDFs (macOS Ollama)
uvx vlmbench run -m qwen3-vl:2b -i ./images/

# Linux + vLLM Docker (auto-starts with --gpus all)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/

# HuggingFace dataset (images)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64

# HuggingFace dataset (text-only — use a column as the prompt)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Concurrency sweep
uvx vlmbench run -m Qwen/Qwen3-VL-8B-Instruct -i ./images/ \
  --concurrency 4,8,16,32,64

# Use a model profile (custom serve args + setup)
uvx vlmbench run --profile deepseek-ocr -i ./images/

# Cloud / remote API (model auto-detected from server)
uvx vlmbench run -i ./images/ \
  --base-url https://my-server.example.com/v1 --api-key $API_KEY

# Cloud API with explicit model
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/ \
  --base-url https://api.openai.com/v1 --api-key $OPENAI_API_KEY

Or install it: pip install vlmbench

Example Run

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64 \
  --prompt "Describe this image in 80 words or less" \
  --concurrency 4,8,16 --backend vllm

╭─ Configuration ──────────────────────────────────────────────────────────────╮
│                                                                              │
│  model        Qwen/Qwen3-VL-2B-Instruct                                      │
│  revision     main                                                           │
│  backend      vLLM 0.11.2                                                    │
│  endpoint     http://localhost:8000/v1                                       │
│                                                                              │
│  gpu          NVIDIA RTX PRO 6000 Blackwell Workstation Edition              │
│  vram         97,887 MiB                                                     │
│  driver       580.126.09                                                     │
│                                                                              │
│  dataset      hf://vlm-run/FineVision-vlmbench-mini                          │
│  images       64 (mixed)                                                     │
│                                                                              │
│  max_tokens   2048                                                           │
│  runs         3                                                              │
│  concurrency  8                                                              │
│                                                                              │
│  monitor      tmux attach -t vlmbench-vllm                                   │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─ Results ────────────────────────────────────────────────────────────────────╮
│                                                                              │
│  Metric                Value              p50        p95        p99          │
│  Throughput            13.33 img/s         —          —          —           │
│  Tokens/sec            1168 tok/s          —          —          —           │
│  Workers               8                   —          —          —           │
│  TTFT                  58 ms           51 ms     114 ms     140 ms           │
│  TPOT                  5.3 ms         5.0 ms     7.3 ms     7.4 ms           │
│  Latency (per worker)  0.54 s/img     0.46 s     0.92 s     1.36 s           │
│                                                                              │
│  Tokens (avg)          prompt 2,077  •  completion 88                        │
│  Token ranges          prompt 180–8,545  •  completion 55–190                │
│  Images                144  •  avg 964×867 (0.93 MP)                         │
│  Resolution            min 338×266  •  median 1024×768  •  max 2048×1755     │
│  VRAM peak             69.7 GB                                               │
│  Reliability           192/192 ok  •  14.4s total                            │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Leaderboard

Best peak throughput per model on NVIDIA RTX PRO 6000 Blackwell (vLLM v0.15.1, 39 runs across concurrency sweeps):

#	Model	Best Tok/s	Workers	TTFT	TPOT
1	`lightonai/LightOnOCR-2-1B`	2,439.8	32	1,439 ms	22.1 ms
2	`Qwen/Qwen3-VL-2B-Instruct`	2,409.3	64	440 ms	14.3 ms
3	`PaddlePaddle/PaddleOCR-VL`	2,341.9	64	6,385 ms	49.0 ms
4	`deepseek-ai/DeepSeek-OCR`	1,195.8	32	3,571 ms	15.9 ms
5	`Qwen/Qwen3-VL-8B-Instruct`	953.8	64	448 ms	25.7 ms

Compare your own results:

uvx vlmbench compare                       # auto-discovers ~/.vlmbench/benchmarks/
uvx vlmbench compare results/*.json        # or pass files explicitly

See MODELS.md for all tested models and their required --serve-args.

Profiles

Some models need custom Docker images, extra pip installs, or special serve args. Profiles bundle all of this into a single YAML file — just pass --profile and vlmbench handles the rest.

uvx vlmbench profiles                                  # list available profiles
uvx vlmbench run --profile deepseek-ocr -i ./images/   # run with a profile

When you use --profile, it sets --model, --prompt, --serve-args, and (for Docker builds) the base image and setup commands. You can still override any flag explicitly.

Profile	Model	Base Image	Custom Setup
`glm-ocr`	`zai-org/GLM-OCR`	`vllm/vllm-openai:nightly`	vLLM nightly + transformers >= 5.1.0, MTP speculative decoding
`deepseek-ocr`	`deepseek-ai/DeepSeek-OCR`	`vllm/vllm-openai:v0.15.1`	Custom logits processor, no prefix caching
`paddleocr-vl`	`PaddlePaddle/PaddleOCR-VL`	`vllm/vllm-openai:v0.15.1`	Trust remote code, no prefix caching
`qwen3-vl-2b`	`Qwen/Qwen3-VL-2B-Instruct`	`vllm/vllm-openai:v0.15.1`	—
`qwen3-vl-8b`	`Qwen/Qwen3-VL-8B-Instruct`	`vllm/vllm-openai:v0.15.1`	—

Profiles live in vlmbench/profiles/*.yaml and ship with the package. For local Docker workflows:

make build PROFILE=glm-ocr        # generates Dockerfile + docker build
make serve PROFILE=glm-ocr        # start server in tmux
make benchmark PROFILE=glm-ocr    # run benchmark against the server

CLI Reference

Flag	Default	Description
`--model` / `-m`	auto-detect	Model ID. Auto-detected from server if omitted; required only with `--serve`.
`--profile`	none	Model profile (e.g. `glm-ocr`). Sets model, prompt, serve-args. See `vlmbench profiles`.
`--input` / `-i`	sample URL	File, directory, or URL (images, PDFs, videos)
`--dataset` / `-d`	none	HuggingFace dataset (e.g. `hf://vlm-run/FineVision-vlmbench-mini`)
`--dataset-image-col`	auto-detect	Image column name in HF dataset
`--dataset-text-col`	none	Text column name in HF dataset to use as prompt/document input
`--dataset-split`	`train`	Dataset split to load
`--base-url`	auto-detect	OpenAI-compatible base URL
`--api-key`	`no-key`	API key (also reads `OPENAI_API_KEY` env)
`--prompt`	`"Extract all text..."`	Prompt/instruction sent with each input. Pass `""` to use the text column as the full message.
`--max-tokens`	`2048`	Max completion tokens
`--runs`	`3`	Timed runs per input
`--warmup`	`1`	Warmup runs (not recorded, fail-fast on errors)
`--concurrency`	`8`	Single value or comma-separated sweep (e.g. `4,8,16,32,64`)
`--max-samples`	all	Limit number of input samples (useful for dry-runs)
`--output-directory`	`~/.vlmbench/benchmarks/`	Output directory
`--tag`	none	Custom label (used in result filename and metadata)
`--upload`	off	Upload results to HuggingFace (requires `HF_TOKEN`)
`--upload-repo`	`vlm-run/vlmbench-results`	HuggingFace dataset repo for uploads
`--backend`	`auto`	`auto`, `ollama`, `vllm`, `vllm-openai:<tag>`, `sglang:<tag>`
`--serve/--no-serve`	`--serve`	Auto-start server if none detected
`--serve-args`	none	Extra args passed to server
`--quant`	`auto`	Quantization metadata: `fp16`, `bf16`, `q4_K_M`, etc.
`--revision`	`main`	Model revision metadata

Backends

`--backend`	Resolves to	Serving
`auto`	`ollama` on macOS, `vllm-openai:latest` on Linux	Native / Docker
`ollama`	Ollama native	`ollama serve` in tmux
`vllm`	Native vLLM	`vllm serve` in tmux
`vllm-openai:latest`	`vllm/vllm-openai:latest`	`docker run --gpus all`
`vllm-openai:nightly`	`vllm/vllm-openai:nightly`	`docker run --gpus all`
`sglang:latest`	`lmsysorg/sglang:latest`	`docker run --gpus all` (coming soon)

All Docker backends run with --gpus all --ipc=host and a deterministic container name for easy log access.

Input Types

Type	Source	Processing
Image	`--input` (`.png`, `.jpg`, `.jpeg`, `.webp`, `.tiff`, `.bmp`)	Base64 encode
PDF	`--input` (`.pdf`)	`pypdfium2` per-page → base64
Video	`--input` (`.mp4`, `.mov`, `.avi`, `.mkv`, `.webm`)	`ffmpeg` 1fps → frames → base64
HF image dataset	`--dataset hf://...`	Auto-detect image column, base64 encode
HF text dataset	`--dataset hf://... --dataset-text-col <col>`	Each row's value sent as a text content block

Directories are processed recursively, sorted alphabetically.

Text-only benchmarks

For LLM (non-vision) benchmarks, use an HF dataset with a text column:

# Each row's "prompt" column is the full message (--prompt "" = no instruction appended)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Each row's "text" column is the document; --prompt is the instruction
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-docs --dataset-text-col text \
  --prompt "Summarize the above in one sentence."

Auto-detection falls back to text columns (named text, prompt, input, content, query, question, instruction) when no image column is found.

Output

Results are saved as JSON to ~/.vlmbench/benchmarks/ with model metadata, environment info, benchmark stats (TTFT, TPOT, throughput, latency percentiles), and raw per-run data. Each concurrency level produces a separate file.

Upload results to HuggingFace with --upload:

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -d hf://vlm-run/FineVision-vlmbench-mini \
  --concurrency 4,8,16,32,64 --upload

Browse uploaded results at vlm-run/vlmbench-results.

How It Works

When you run vlmbench run, here's what happens:

Detects your platform — macOS routes to Ollama, Linux to vLLM Docker
Pulls the Docker image — docker pull vllm/vllm-openai:latest (cached after first run)
Starts the server in tmux — docker run --gpus all in a named session (vlmbench-vllm)
Launches a GPU monitor — nvitop (Linux) or macmon (macOS) in a split pane
Waits for the server — polls /v1/models until ready (up to 600s)
Runs warmup requests — fail-fast validation before timed runs
Benchmarks with concurrency — streams completions via the OpenAI API, measures TTFT/TPOT/throughput
Saves results as JSON — one file per concurrency level in ~/.vlmbench/benchmarks/

Attach to the live session anytime: tmux attach -t vlmbench-vllm

tmux session capture — server logs + GPU monitor side by side

Top pane — vLLM server logs:

(APIServer pid=1) INFO 02-07 15:44:24 non-default args: {
  'model': 'lightonai/LightOnOCR-2-1B',
  'enable_prefix_caching': False,
  'limit_mm_per_prompt': {'image': 1},
  'mm_processor_cache_gb': 0.0
}
(APIServer pid=1) INFO 02-07 15:44:34 Resolved architecture: LightOnOCRForConditionalGeneration
(APIServer pid=1) INFO 02-07 15:44:34 Using max model len 16384
(EngineCore pid=272) INFO 02-07 15:44:44 Initializing a V1 LLM engine (v0.15.1) with config:
  model='lightonai/LightOnOCR-2-1B', dtype=torch.bfloat16, max_seq_len=16384,
  tensor_parallel_size=1, quantization=None
(EngineCore pid=272) INFO 02-07 15:45:41 Loading weights took 0.49 seconds
(EngineCore pid=272) INFO 02-07 15:45:42 Model loading took 1.88 GiB memory and 22.15 seconds
(EngineCore pid=272) INFO 02-07 15:46:11 Available KV cache memory: 77.94 GiB
(EngineCore pid=272) INFO 02-07 15:46:11 Maximum concurrency for 16,384 tokens per request: 44.53x
Capturing CUDA graphs (decode, FULL): 100% |██████████| 51/51
(APIServer pid=1) INFO Started server process [1]
(APIServer pid=1) INFO Application startup complete.
(APIServer pid=1) INFO 172.17.0.1 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Bottom pane — nvitop GPU monitor:

NVITOP 1.6.2      Driver Version: 580.126.09      CUDA Driver Version: 13.0
╒═══════════════════════════════╤══════════════════════╤══════════════════════╕
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│   0  GeForce RTX 2080 Ti  Off │ 00000000:21:00.0 Off │                  N/A │
│ 27%   42C   P8     17W / 250W │  107.2MiB / 11264MiB │      0%      Default │
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│   1  RTX PRO 6000         Off │ 00000000:4B:00.0 Off │                  N/A │
│ 30%   33C   P1     66W / 600W │  86.54GiB / 95.59GiB │      0%      Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛
  MEM: ███████████████████████████████████████████████████████████▏ 90.5%
  Load Average: 4.14  2.73  1.65

Claude Code Installation

Install vlmbench as a Claude Code plugin:

# 1. Register the marketplace
/plugin marketplace add vlm-run/vlmbench

# 2. Install the skill
/plugin install vlmbench@vlm-run/vlmbench

After restarting Claude Code, the vlmbench skill will be available. Mention it directly in your instructions to benchmark models, compare results, or debug server issues.

Requirements

Python >= 3.11, uv recommended
Linux: Docker + NVIDIA GPU support (or native vLLM via uv pip install vllm)
Monitoring: tmux, nvitop (Linux) or macmon (macOS)
Optional: ffmpeg (video frame extraction)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.claude-plugin		.claude-plugin
.claude/skills/vlmbench		.claude/skills/vlmbench
.github/workflows		.github/workflows
assets		assets
scripts		scripts
tests		tests
vlmbench		vlmbench
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vlmbench

Quick Start

Example Run

Leaderboard

Profiles

CLI Reference

Backends

Input Types

Text-only benchmarks

Output

How It Works

Claude Code Installation

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vlmbench

Quick Start

Example Run

Leaderboard

Profiles

CLI Reference

Backends

Input Types

Text-only benchmarks

Output

How It Works

Claude Code Installation

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages