Summary
On ROCm/HIP builds, the host unconditionally launches indexer_scores_wmma_kernel, but that kernel's body is gated by #if __CUDA_ARCH__ >= 700. hipcc never defines __CUDA_ARCH__, so on HIP the kernel compiles to an empty body and the output scores buffer is left uninitialized on the head_dim == 128, n_head == 64, !quality_mode indexer path. This silently produces incorrect results on any ROCm GPU that takes this path (observed targeting gfx1151 / Radeon 8060S, Strix Halo).
Where
- Kernel definition:
ds4_cuda.cu (~line 4483), guarded #if __CUDA_ARCH__ >= 700.
- Host launch:
ds4_cuda.cu (~line 5002) — launches the WMMA kernel instead of falling through to the scalar indexer_scores_kernel, which fully initializes scores.
Impact
Wrong indexer output (silently, no error) on ROCm for the affected attention shape. Because it's an uninitialized-buffer read, symptoms can be nondeterministic.
Relationship to #311
PR #311 (9565c07) masks this by re-enabling the WMMA body on HIP — it changes the guard to #if __CUDA_ARCH__ >= 700 || defined(__HIP_DEVICE_COMPILE__) and supplies the matrix API via #include <rocwmma/rocwmma.hpp> + namespace wmma = rocwmma; in ds4_rocm.h. That works only where full rocWMMA headers are installed.
Environments that ship just the rocWMMA version header — notably the AMD HIP SDK for Windows (no rocwmma/ directory under include/) — cannot compile that path (wmma::fragment undefined). There, the correct behavior is to use the scalar fallback.
Suggested fix
Either:
Option (a) is what we're using to build/run natively on the Windows HIP SDK; happy to send a small PR for it if useful.
Environment
- GPU: AMD Radeon 8060S (gfx1151, Strix Halo).
- ROCm 7.1 (AMD HIP SDK for Windows) and ROCm 7.13 (Linux pip SDK).
- Found while adding a native Windows (hipcc) ROCm build of the bench.
Summary
On ROCm/HIP builds, the host unconditionally launches
indexer_scores_wmma_kernel, but that kernel's body is gated by#if __CUDA_ARCH__ >= 700.hipccnever defines__CUDA_ARCH__, so on HIP the kernel compiles to an empty body and the outputscoresbuffer is left uninitialized on thehead_dim == 128, n_head == 64, !quality_modeindexer path. This silently produces incorrect results on any ROCm GPU that takes this path (observed targeting gfx1151 / Radeon 8060S, Strix Halo).Where
ds4_cuda.cu(~line 4483), guarded#if __CUDA_ARCH__ >= 700.ds4_cuda.cu(~line 5002) — launches the WMMA kernel instead of falling through to the scalarindexer_scores_kernel, which fully initializesscores.Impact
Wrong indexer output (silently, no error) on ROCm for the affected attention shape. Because it's an uninitialized-buffer read, symptoms can be nondeterministic.
Relationship to #311
PR #311 (
9565c07) masks this by re-enabling the WMMA body on HIP — it changes the guard to#if __CUDA_ARCH__ >= 700 || defined(__HIP_DEVICE_COMPILE__)and supplies the matrix API via#include <rocwmma/rocwmma.hpp>+namespace wmma = rocwmma;inds4_rocm.h. That works only where full rocWMMA headers are installed.Environments that ship just the rocWMMA version header — notably the AMD HIP SDK for Windows (no
rocwmma/directory underinclude/) — cannot compile that path (wmma::fragmentundefined). There, the correct behavior is to use the scalar fallback.Suggested fix
Either:
indexer_scores_kernel(e.g.#ifndef __HIP_PLATFORM_AMD__around the WMMA launch, or aDS4_CUDA_NO_INDEXER_WMMAtoggle). Keeps ROCm buildable on the Windows HIP SDK. — or —Option (a) is what we're using to build/run natively on the Windows HIP SDK; happy to send a small PR for it if useful.
Environment