Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion benchmark/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ The following command pulls the Docker image from Docker Hub.
docker pull vllm/vllm-openai-rocm:v0.17.1
```

For Gemma 4, use the Gemma4-tagged image (also referenced by [`docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile`](../../docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile)):

```sh
docker pull vllm/vllm-openai-rocm:gemma4
```

### MAD-integrated benchmarking

Clone the ROCm Model Automation and Dashboarding (MAD) repository to a local directory and install the required packages on the host machine.
Expand All @@ -86,7 +92,17 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
#### Available models

>[!NOTE]
>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.
>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.

>[!NOTE]
>Gemma 4 models (`pyt_vllm_gemma-4-*`) are built from `vllm/vllm-openai-rocm:gemma4` (see [`docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile`](../../docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile)). Accept Google’s Gemma license on Hugging Face and set `MAD_SECRETS_HFTOKEN` for gated weight downloads.

Serving recipes for Gemma 4 live in [`scripts/vllm/configs/default.yaml`](../../scripts/vllm/configs/default.yaml). Both Gemma 4 entries use **tensor parallel size 1**, **`TRITON_ATTN`**, **`float16` on gfx942** (via `arch_overrides`), **`--max-model-len` 32768**, text-only multimodal limits (`--limit-mm-per-prompt`), and **`VLLM_ROCM_USE_AITER=1`** where supported.

| Model | Notes |
| ----- | ----- |
| **google/gemma-4-31B-it** | Dense instruct. Full serving sweep: **`max_concurrency` 1, 8, 32, 128** (four cold starts). |
| **google/gemma-4-26B-A4B-it** | Sparse MoE (“A4B”). **AITER fused MoE is disabled** via **`VLLM_ROCM_USE_AITER_MOE=0`** so MoE runs on the **Triton** path. **Concurrency sweep is narrowed to 1 and 8** only for typical MAD Docker memory limits. |

| MAD model name | Model repo |
| -------------------------------------- | -------------------------------------- |
Expand All @@ -112,6 +128,8 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
| pyt_vllm_mixtral-8x22b | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://hugggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
| pyt_vllm_mixtral-8x22b_fp8 | [amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV](https://hugggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV) |
| pyt_vllm_phi-4 | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
| pyt_vllm_gemma-4-26b-a4b-it | [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) |
| pyt_vllm_gemma-4-31b-it | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
| pyt_vllm_qwen3-8b | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |
| pyt_vllm_qwen3-32b | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) |
| pyt_vllm_qwen3-30b-a3b | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) |
Expand All @@ -132,6 +150,8 @@ docker pull vllm/vllm-openai-rocm:v0.17.1
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.17.1
```

For Gemma 4 standalone runs, substitute `vllm/vllm-openai-rocm:gemma4` for the image tag in the `docker run` line above. For **`google/gemma-4-26B-A4B-it`** only, also set **`VLLM_ROCM_USE_AITER_MOE=0`** (same as the MAD `default.yaml` recipe) so MoE does not use AITER’s fused path.

>[!NOTE]
>We enable [AITER](https://github.com/ROCm/aiter) during `docker run` via `--env VLLM_ROCM_USE_AITER=1` for best performance
>on MI3xx (i.e. gfx942 and gfx950) platforms. If you're using this docker image on other AMD GPUs e.g. MI2xx or Radeon,
Expand Down Expand Up @@ -345,6 +365,10 @@ owners and are only mentioned for informative purposes.   
----------
This release note summarizes notable changes since the previous docker release.

MAD `pyt_vllm_gemma-4-*` configs (see [`default.yaml`](../../scripts/vllm/configs/default.yaml)):
- **gemma-4-26B-A4B-it:** set `VLLM_ROCM_USE_AITER_MOE=0` (Triton MoE); narrowed default `max_concurrency` to **1 8** to avoid OOM on repeated server restarts.
- **gemma-4-31B-it:** unchanged full sweep **1 8 32 128**; no `VLLM_ROCM_USE_AITER_MOE` override.

v0.17.1 release:
- Includes documentation and patches for upstream releases. Please track https://github.com/vllm-project/vllm/releases
for all future release notes.
Expand Down
42 changes: 42 additions & 0 deletions docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
###############################################################################
#
# MIT License
#
# Copyright (c) Advanced Micro Devices, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
#################################################################################
# Gemma 4 requires a vLLM build with Gemma4 support; see vLLM recipes (Google/Gemma4.md).
ARG BASE_DOCKER=vllm/vllm-openai-rocm:gemma4
FROM $BASE_DOCKER

USER root
ENV WORKSPACE_DIR=/workspace
RUN mkdir -p $WORKSPACE_DIR
WORKDIR $WORKSPACE_DIR

RUN pip3 install --no-cache-dir "transformers==5.5.0"

# record configuration for posterity
RUN pip3 list

# Specify entrypoint to override upstream
ENTRYPOINT [""]
38 changes: 38 additions & 0 deletions models.json
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,44 @@
"args":
"--model_repo Qwen/Qwen3-8B --config configs/extended.yaml"
},
{
"name": "pyt_vllm_gemma-4-26b-a4b-it",
"data": "huggingface",
"dockerfile": "docker/pyt_vllm_gemma4",
"scripts": "scripts/vllm/run.sh",
"n_gpus": "-1",
"owner": "mad.support@amd.com",
"training_precision": "",
"multiple_results": "perf_gemma-4-26B-A4B-it.csv",
"tags": [
"pyt",
"vllm",
"vllm_extended",
"inference"
],
"timeout": -1,
"args":
"--model_repo google/gemma-4-26B-A4B-it --config configs/default.yaml"
},
{
"name": "pyt_vllm_gemma-4-31b-it",
"data": "huggingface",
"dockerfile": "docker/pyt_vllm_gemma4",
"scripts": "scripts/vllm/run.sh",
"n_gpus": "-1",
"owner": "mad.support@amd.com",
"training_precision": "",
"multiple_results": "perf_gemma-4-31B-it.csv",
"tags": [
"pyt",
"vllm",
"vllm_extended",
"inference"
],
"timeout": -1,
"args":
"--model_repo google/gemma-4-31B-it --config configs/default.yaml"
},
{
"name": "pyt_vllm_qwen3-32b",
"data": "huggingface",
Expand Down
41 changes: 41 additions & 0 deletions scripts/vllm/configs/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,47 @@
VLLM_ROCM_USE_AITER: 1
extra_args:
--attention-backend: ROCM_ATTN
arch_overrides:
gfx942:
dtype: float16

## Gemma 4: vLLM recipe recommends 1x MI300-class GPU (BF16); tp 1 for text-only bench
## Use TRITON_ATTN (Gemma4 default); 26B-A4B MoE: VLLM_ROCM_USE_AITER_MOE=0; narrow concurrency for 26B to avoid OOM
- benchmark: serving
model: google/gemma-4-26B-A4B-it
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8
env:
VLLM_ROCM_USE_AITER: 1
VLLM_ROCM_USE_AITER_MOE: 0
extra_args:
--attention-backend: TRITON_ATTN
--max-model-len: 32768
--gpu-memory-utilization: 0.90
--limit-mm-per-prompt: '{"image":0,"audio":0}'
--async-scheduling: True
arch_overrides:
gfx942:
dtype: float16

- benchmark: serving
model: google/gemma-4-31B-it
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
env:
VLLM_ROCM_USE_AITER: 1
extra_args:
--attention-backend: TRITON_ATTN
--max-model-len: 32768
--gpu-memory-utilization: 0.90
--limit-mm-per-prompt: '{"image":0,"audio":0}'
--async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
12 changes: 11 additions & 1 deletion scripts/vllm/run_vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import signal
import argparse
import itertools
import shlex
import subprocess
from typing import List, Dict

Expand Down Expand Up @@ -490,7 +491,16 @@ def main():
if isinstance(v, bool):
extra_args_str += f" {k}"
else:
extra_args_str += f" {k} {v}"
s = str(v)
st = s.strip()
if (
k == "--limit-mm-per-prompt"
or (st[:1] in "{[")
or any(ch.isspace() for ch in s)
):
extra_args_str += f" {k} {shlex.quote(s)}"
else:
extra_args_str += f" {k} {v}"
Comment on lines +494 to +503
config["env"] = env_vars_str
config["extra_args"] = extra_args_str

Expand Down