Skip to content

Use gfx11-ci Docker container for wheel build and kernel tests#873

Merged
mgehre-amd merged 4 commits intogfx11from
matthias.gfx11-ci-use-docker
Apr 14, 2026
Merged

Use gfx11-ci Docker container for wheel build and kernel tests#873
mgehre-amd merged 4 commits intogfx11from
matthias.gfx11-ci-use-docker

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented Apr 14, 2026

Summary

  • Switch build-wheel and test-kernels jobs to use the pre-built ghcr.io/rocm/vllm/gfx11-ci:latest container image (from Add Docker CI image for gfx11 wheel builds #872)
  • Eliminates ~15 minutes of setup overhead per CI run by removing ROCm SDK install, env var configuration, free-disk-space, setup-python, pip cache, and sccache-action steps
  • Test job uses --device /dev/kfd --device /dev/dri for GPU passthrough on the self-hosted Strix Halo runner
  • Step count reduced from ~11 to ~6 (build) and ~7 to ~4 (test)

Verified locally

  • Built vLLM wheel inside gfx11-ci:local container
  • Ran test_hip_w4a16.py (106/106 passed) with GPU passthrough

Test plan

  • CI build-wheel job succeeds with the container
  • CI test-kernels job detects the GPU and passes kernel tests
  • upload-wheel job (unchanged) still works on push to gfx11

Switch build-wheel and test-kernels jobs to use the pre-built
ghcr.io/rocm/vllm/gfx11-ci container image which has ROCm SDK,
PyTorch, sccache, and uv pre-baked.

This eliminates ~15 minutes of setup overhead per CI run by removing:
- free-disk-space step (ROCm/PyTorch already in image layers)
- setup-python, pip cache, sccache-action steps
- ROCm SDK pip install and env var configuration
- uv/system-deps installation in the test job

The test-kernels job uses --device /dev/kfd --device /dev/dri for
GPU passthrough on the self-hosted Strix Halo runner.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The GHA cache service URL (ACTIONS_CACHE_URL) is not forwarded
into container jobs, causing sccache to fail on startup. Remove
the flag so sccache operates as a local-only cache within the build.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Container jobs don't automatically inherit ACTIONS_CACHE_URL,
ACTIONS_RESULTS_URL, and ACTIONS_RUNTIME_TOKEN from the runner.
Use actions/github-script to export them so sccache can use the
GitHub Actions cache backend for cross-run compilation caching.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.gfx11-ci-use-docker branch from 460f873 to a65bdff Compare April 14, 2026 08:58
sccache v0.8.1 uses the legacy GitHub Actions cache v1 API which was
sunset April 2025. v0.14.0 supports the v2 API (ACTIONS_RESULTS_URL).
Upgrade in both the Dockerfile and as a runtime override in the
workflow until the next image rebuild picks up the Dockerfile change.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.gfx11-ci-use-docker branch from a65bdff to d499123 Compare April 14, 2026 09:05
@mgehre-amd mgehre-amd merged commit bd1bc23 into gfx11 Apr 14, 2026
7 of 8 checks passed
@mgehre-amd mgehre-amd deleted the matthias.gfx11-ci-use-docker branch April 14, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant