Scenario-based benchmarking and profiling helpers for vLLM serving workloads.
For each scenario in a YAML config, vllm_bench.py:
- Launches
vllm servewith scenario-specific parameters. - Runs
vllm bench serveacross one or more concurrency points. - Saves benchmark JSON output and a cross-scenario summary CSV.
- Optionally collects:
- Nsight Systems (
nsys) traces - PyTorch Profiler traces
- Nsight Systems (
This is useful for repeatable performance studies, regression tracking, and profiling runs.
pip install -r requirements.txtInstall these if you use the legacy shell scripts or profiling workflows:
sudo apt-get install -y jq curl
pip install yqProfiling tools (optional but recommended for GPU analysis):
- Nsight Systems (
nsys) - Nsight Compute (
ncu) - PyTorch Profiler (enabled through scenario config)
python vllm_bench.py <config.yaml> [--scenario name1,name2] [--delay SEC] [--duration SEC] [--mlflow-experiment NAME] [--mlflow-run-name NAME] [--mlflow-tracking-uri URI] [--mlflow-tag KEY=VALUE]# Run all scenarios
python vllm_bench.py configs/models/gpt-oss-20b.yaml
# Run a single scenario
python vllm_bench.py configs/models/gpt-oss-20b.yaml --scenario baseline
# Run multiple scenarios
python vllm_bench.py configs/models/gpt-oss-20b.yaml --scenarios baseline,async_scheduling
# Start nsys after benchmark has already started
python vllm_bench.py configs/models/gpt-oss-20b.yaml --scenario baseline --delay 15
# Collect nsys for fixed duration (seconds)
python vllm_bench.py configs/models/gpt-oss-20b.yaml --scenario baseline --duration 30
# Upload artifacts to a specific MLflow experiment
python vllm_bench.py configs/models/gpt-oss-20b.yaml --mlflow-experiment vllm-bench --mlflow-run-name gptoss20b-baseline
# Add MLflow tags (repeat --mlflow-tag for multiple)
python vllm_bench.py configs/models/gpt-oss-20b.yaml --mlflow-tag team=perf --mlflow-tag model_family=gptoss
# Disable MLflow uploads
python vllm_bench.py configs/models/gpt-oss-20b.yaml --no-mlflowconfig: YAML file with model/defaults/scenarios.--scenario/--scenarios: comma-separated scenario names to run.--delay: delay beforensys start(useful with warmup-heavy startup).--duration: stop nsys after fixed time instead of end-of-benchmark.--mlflow-experiment: MLflow experiment name.--mlflow-run-name: MLflow run name override.--mlflow-tracking-uri: custom MLflow tracking URI.--mlflow-tag: MLflow tag asKEY=VALUE(repeatable).--no-mlflow: skip MLflow upload.
At minimum, your config should contain:
model:
name: meta-llama/Llama-3.1-8B-Instruct
base_params: "--gpu-memory-utilization 0.9"
defaults:
study_dir: Study_llama
env:
VLLM_USE_V1: "1"
bench:
concurrencies: [1, 8, 32]
input_len: 1024
output_len: 128
cc_mult: 10
scenarios:
- name: baseline
port: 8000
params: "--max-model-len 8192"
bench:
concurrencies: [1, 16, 64]
profile: true
profiling:
nsys_launch_args: "--trace=cuda,nvtx,osrt --start-later=true"
nsys_start_args: "--force-overwrite=true --gpu-metrics-devices=cuda-visible"
torch_profiler:
enabled: trueEach run creates a timestamped study directory:
<study_dir>_<timestamp>/
config.yaml
summary.csv
scenario_<name>/
logs/
vllm_server.log
results/
<result_prefix>.json
profiles/
nsys_server.qdrep|.nsys-rep # when using direct `nsys profile` mode
nsys_conc<k>.qdrep|.nsys-rep # when using start/stop session mode
torch/
trace_conc<k>*
Notes:
summary.csvaggregates every scenario/concurrency run.- Nsight output now writes under each scenario
profiles/directory. - Exact Nsight extension varies by nsys version (
.qdrepand/or.nsys-rep). - Run output is captured in
logs/benchmark_output.log. - MLflow upload includes the full study directory,
nvidia-smi,lscpu, command metadata, config, and run log.
vllm_bench.sh remains available for older workflows/configs:
bash vllm_bench.sh config.yaml
bash vllm_bench.sh configs/models/gpt-oss-20b.yaml --scenario baselineFor code flow and debugging notes, see CODE_README.md.