win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348) by jamesburton · Pull Request #352 · antirez/ds4

jamesburton · 2026-06-07T20:48:40Z

Summary

Adds a native Windows ROCm/HIP GPU build of ds4 for gfx1151 (AMD Strix Halo / Ryzen AI MAX+ 395, Radeon 8060S iGPU) — no WSL — and fixes a ROCm correctness bug along the way. Built and run on real silicon with the AMD HIP SDK for Windows (ROCm 7.1).

This continues the native-Windows series started in #344 (CPU build) and fixes #348 (uninitialized indexer scores on HIP).

⚠️ Draft / stacked PR. This branch builds on three still-open PRs, so their commits appear here too — see Dependencies. It's verified on the Windows HIP SDK path so far (not yet Linux ROCm). Happy to rebase / split however you prefer.

Commits (over `rocm`)

commit	what	note
`e217b65`	`DS4_CUDA_MANAGED` for full UMA pool	= #313 (dependency)
`9953dc0`	chunked model copy without OOM	= #320 (dependency)
`06b282c`	native Windows MinGW CPU build (POSIX shim)	= #344 (dependency)
`3cb9e28`	native Windows ROCm/HIP GPU build for gfx1151	⭐ this PR
`3b561b7`	64-bit file stat so >2 GB models load (fix `EOVERFLOW`)	⭐ this PR
`4faede4`	indexer scalar fallback when rocWMMA unavailable	⭐ fixes #348
`8795f3e`	warp pair-matmul kernel (+30% gen)	overlaps #311 — easy to drop

Details

Native Windows ROCm build (3cb9e28). Compiles ds4_cuda.cu with hipcc/clang from the AMD ROCm 7.1 Windows SDK (MSVC ABI) via a dependency-free POSIX shim — no WSL, no /opt/rocm. The Windows HIP SDK ships only the rocWMMA version header (no rocwmma/ tree), which motivates the next two fixes.

64-bit file stat (3b561b7). The CRT default struct stat/fstat are 32-bit → EOVERFLOW ("value too large") on the ~80 GB q2 model. Remaps stat/fstat → _stat64 family. Both the HIP (hipcc/MSVC) and MinGW CPU builds verified.

Indexer scalar fallback — fixes #348 (4faede4). On HIP, indexer_scores_wmma_kernel was launched unconditionally but its body is gated by #if __CUDA_ARCH__ >= 700, which hipcc never defines → the kernel compiles to an empty body and leaves scores uninitialized on the head_dim==128, n_head==64, !quality path (silent, nondeterministic wrong output). This guards the launch so HIP without full rocWMMA falls through to the scalar indexer_scores_kernel. (#311 instead enables the WMMA body on HIP via <rocwmma/rocwmma.hpp>, which can't compile on the Windows HIP SDK — see #348 for the full analysis.) Correctness confirmed via the deterministic MoE check (DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 → top1_mismatch=0).

Warp pair-matmul (8795f3e). Ports the coalesced warp-stride f16 pair-matmul (one warp per output row, reduced via warp_sum_f32); gen-only (n_tok==1), CUDA path byte-identical, default ON (revert DS4_ROCM_NO_F16_PAIR_WARP_MATMUL=1). This is #311's kernel — included here because we measured it, but trivially dropped if you'd rather keep it with #311.

Benchmarks (Radeon 8060S, gfx1151, q2-imatrix 80.76 GiB, 96 GB VRAM split)

Generation: 10.5–11.5 tok/s at ≤8k context (cool); best 11.47 tok/s at ctx 4096 with -t 4. In line with the gfx1151 figures in rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent #311 (~11.79) / cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo #313 (~8.71 @ 1M ctx).
Warp kernel: +30% cool (8.0→10.5 at 2k/4k), +10–23% even when thermally soaked.
No VRAM OOM found: KV allocated up to 1,048,576 tokens (18.0 GiB context buffers + 80.76 GiB model) on the 96 GB split.
Stable: 12 back-to-back runs flat at 10.67–10.69 tok/s; a 1024-token generation holds 10.60.
Gen throughput tracks filled context depth (≈10.5 ≤8k → ≈5 at 16–32k, attention cost), not allocated size; high context is prefill-bound.

Full 38-run methodology + data: https://github.com/jamesburton/ds4-rocm/tree/main/results

Test environment

GPU: AMD Radeon 8060S (gfx1151, Strix Halo), 128 GB LPDDR5X-8533, BIOS 96 GB iGPU VRAM split.
Toolchain: AMD HIP SDK for Windows, ROCm 7.1; MSVC ABI; MinGW GCC 15.2 for the CPU build.
Runtime env: DS4_CUDA_COPY_MODEL_CHUNKED=1, DS4_LOCK_FILE=<win path>.

Dependencies

Built on #344 (CPU build), #313 (DS4_CUDA_MANAGED), #320 (chunked copy) — all currently open. Their commits are included in this branch; this PR is cleanest reviewed/merged after them. Fixes #348.

Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to cudaMallocManaged (full unified memory pool). This is critical for UMA platforms like AMD Strix Halo where the BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory (128 GB). Without this, context buffers are limited to the BIOS carve-out, capping usable context at ~870K when the model weights occupy ~81 GB. With managed memory, the full 128 GB pool is available, enabling 1M-token context. The change is opt-in and zero-overhead when the env var is unset. Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm compatibility. Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151): DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000 → context buffers 17222.50 MiB, server starts successfully → 1M context chat completion: 8.71 t/s, correct output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When DS4_CUDA_COPY_MODEL_CHUNKED is set, skip cudaHostRegister. Registering a large memory map prevents posix_madvise(DONTNEED) from freeing pages during the chunked copy, leading to catastrophic system RAM exhaustion on APUs with unified memory.

Adds native Windows CPU build support for ds4-bench using MinGW-w64 GCC (no WSL, Cygwin or MSVC). A small header-only POSIX shim, ds4_win.h, supplies the surface MinGW/UCRT lacks: mmap/munmap/madvise, sysconf, flock/fcntl/pread/dprintf, and a temp-file-backed fmemopen. MinGW already provides pthread, clock_gettime and ftruncate. The shim is wired in-tree behind #ifdef _WIN32: ds4.c includes ds4_win.h in place of <sys/mman.h> on Windows. The header's whole body is guarded by _WIN32, so POSIX (Linux/macOS/CUDA/ROCm) builds are unchanged. Makefile detects MinGW/MSYS (uname -s = MINGW*/MSYS*), defaults CC to gcc there, and adds a `windows-cpu` target that builds ds4-bench.exe with -DDS4_NO_GPU. Darwin/Linux/cuda/rocm paths are untouched. win/README.md documents the build (make + direct-gcc fallback), the DS4_LOCK_FILE runtime note, and the deferred CLI (termios/sigaction) and server (Winsock) ports. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Build DS4's GPU backend natively on Windows with the AMD HIP SDK — no WSL, no MSVC on PATH, no full Visual Studio needed at the command line (hipcc auto-discovers the VS build tools' headers). Produces a gfx1151 ROCm ds4-bench.exe that links the HIP runtime + hipBLAS and starts up, selecting the GPU backend before model load. Toolchain findings (Windows HIP SDK 7.1): - hipcc.exe compiles, links, and runs HIP + hipBLAS for gfx1151 using its bundled clang (target x86_64-pc-windows-msvc); no cl.exe on PATH. - The host ABI is therefore MSVC, so the C host files are compiled with clang --target=...-windows-msvc to match the hipcc-built ds4_cuda.o (mixing MinGW and MSVC C runtimes across the FILE*/heap boundary is unsafe). The MinGW CPU build is left exactly as-is. Changes (all GPU-build portability behind #ifdef _WIN32, sub-guarded by __MINGW32__ / DS4_WIN_PTHREAD, so POSIX/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged): - Vendor the rocWMMA version header (MIT, header-only) the Windows SDK omits; ds4_rocm.h needs only this (DS4's WMMA path is CUDA-only). - ds4_cuda.cu: use the ds4_win.h POSIX shim + <io.h> instead of <unistd.h>; skip the Linux-only st_blksize O_DIRECT hint. - ds4_win.h: add STDERR_FILENO/ssize_t/SSIZE_MAX/off_t, clock_gettime + CLOCK_MONOTONIC, and ftruncate for the MSVC-ABI build (guarded !__MINGW32__ so the MinGW CPU build is untouched). - win/ds4_pthread_win.h: header-only Win32 pthread shim (threads, mutex, condvar, once) used when DS4_WIN_PTHREAD is set (MSVC has no pthread.h). - win/build-rocm.sh: build script that works around hipcc.exe's space-splitting .bat wrapper, and synthesizes the missing MSVC-style hipblas.lib from libhipblas.dll via llvm-dlltool (git-ignored, cached). - Makefile: add a `windows-rocm` target (ROCM_PATH/ROCM_ARCH, gfx1151) delegating to the script; windows-cpu unchanged. - win/README.md: native Windows ROCm build instructions, rocWMMA vendoring, import-lib rationale, and run caveats (PATH, DS4_LOCK_FILE, DS4_CUDA_MANAGED). Runtime: ds4-bench.exe initializes the backend and computes context buffers; full inference is not yet verified (needs the ~80 GB model and a BIOS UMA re-split + reboot). Success here = clean gfx1151 compile + link + startup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Windows CRT default struct stat / stat() / fstat() carry a 32-bit st_size, so stat'ing a file larger than 2GB fails with EOVERFLOW ("value too large") — fatal for the ~80GB DeepSeek V4 GGUF, which the loader fstat()s before mmap. Remap the bare names to the 64-bit _stat64 family (which names both the struct and the functions, with an __int64 st_size). <sys/stat.h> is pulled in first so its real declarations are parsed before the macros exist; the include guard makes ds4.c's and ds4_cuda.cu's later includes (opposite order) no-ops that still see the remap. Verified on both the hipcc/MSVC-ABI GPU build and the MinGW CPU build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ction math Port of the platform-independent correctness fixes from antirez#311 (9565c07) onto the native-Windows-ROCm base, plus the latent indexer bug found during the Windows port. (B) Indexer WMMA correctness bug (affects ALL ROCm builds): indexer_scores_wmma_kernel's body is guarded by `#if __CUDA_ARCH__ >= 700`, which hipcc never defines, so on HIP it compiles to an empty kernel - yet indexer_scores_launch() still LAUNCHED it on the head_dim==128, n_head==64, !quality_mode path, leaving `scores` uninitialised (wrong results). The upstream fix (9565c07) re-enables the WMMA body on HIP via rocWMMA, but the full rocWMMA matrix API is not available in this Windows HIP SDK build (only rocwmma-version.hpp is vendored). Instead, guard the WMMA launch with `#ifndef __HIP_PLATFORM_AMD__` so HIP falls through to the scalar indexer_scores_kernel, which fully initialises `scores`. CUDA keeps WMMA. (A) Precise expert-selection math (from 9565c07): route softplus_dev and the three router_select_* sqrt(softplus(.)) score computations through ds4_precise_{expf,log1pf,sqrtf}. On ROCm these bind to the OCML entry points (__ocml_{exp,log1p,sqrt}_f32) so -fapprox-func/fast-math cannot substitute the lower-precision approximations and flip expert selection. On CUDA they are thin wrappers over expf/log1pf/sqrtf - byte-identical behaviour, no CUDA/Metal/CPU regression. Verified: `make windows-rocm` rebuilds ds4-bench.exe clean for gfx1151; `make windows-cpu` still builds. Runtime ./ds4_test verification is pending the BIOS UMA re-split + reboot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@2k

Port matmul_f16_pair_warp_kernel from upstream PR antirez#311 (9565c07) onto the clean base. One warp (32 lanes) computes one output row across both weight matrices of the paired F16 matmul, with ROWS_PER_BLOCK warps per block. Selected in ds4_gpu_matmul_f16_pair_tensor for the gen path (n_tok==1), gated by DS4_ROCM_NO_F16_PAIR_WARP_MATMUL (revert to ordered_chunks). Mathematically identical to the existing ordered_chunks kernel (same row-major dot product); the win is coalesced global reads (warp-stride vs per-thread contiguous chunk). HIP-only (#ifdef __HIP_PLATFORM_AMD__), so the CUDA path is byte-for-byte unchanged. Measured on Strix Halo gfx1151 (Radeon 8060S), q2-imatrix 80.76 GB, thermal-controlled A/B: gen 8.09->10.53 t/s @2k, 7.95->10.52 t/s @4k (+30%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kmc6042 and others added 7 commits June 6, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352

win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352
jamesburton wants to merge 7 commits into
antirez:rocmfrom
jamesburton:gfx1151-correctness

jamesburton commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesburton commented Jun 7, 2026

Summary

Commits (over rocm)

Details

Benchmarks (Radeon 8060S, gfx1151, q2-imatrix 80.76 GiB, 96 GB VRAM split)

Test environment

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commits (over `rocm`)