Skip to content

win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352

Draft
jamesburton wants to merge 7 commits into
antirez:rocmfrom
jamesburton:gfx1151-correctness
Draft

win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352
jamesburton wants to merge 7 commits into
antirez:rocmfrom
jamesburton:gfx1151-correctness

Conversation

@jamesburton
Copy link
Copy Markdown

Summary

Adds a native Windows ROCm/HIP GPU build of ds4 for gfx1151 (AMD Strix Halo / Ryzen AI MAX+ 395, Radeon 8060S iGPU) — no WSL — and fixes a ROCm correctness bug along the way. Built and run on real silicon with the AMD HIP SDK for Windows (ROCm 7.1).

This continues the native-Windows series started in #344 (CPU build) and fixes #348 (uninitialized indexer scores on HIP).

⚠️ Draft / stacked PR. This branch builds on three still-open PRs, so their commits appear here too — see Dependencies. It's verified on the Windows HIP SDK path so far (not yet Linux ROCm). Happy to rebase / split however you prefer.

Commits (over rocm)

commit what note
e217b65 DS4_CUDA_MANAGED for full UMA pool = #313 (dependency)
9953dc0 chunked model copy without OOM = #320 (dependency)
06b282c native Windows MinGW CPU build (POSIX shim) = #344 (dependency)
3cb9e28 native Windows ROCm/HIP GPU build for gfx1151 ⭐ this PR
3b561b7 64-bit file stat so >2 GB models load (fix EOVERFLOW) ⭐ this PR
4faede4 indexer scalar fallback when rocWMMA unavailable fixes #348
8795f3e warp pair-matmul kernel (+30% gen) overlaps #311 — easy to drop

Details

Native Windows ROCm build (3cb9e28). Compiles ds4_cuda.cu with hipcc/clang from the AMD ROCm 7.1 Windows SDK (MSVC ABI) via a dependency-free POSIX shim — no WSL, no /opt/rocm. The Windows HIP SDK ships only the rocWMMA version header (no rocwmma/ tree), which motivates the next two fixes.

64-bit file stat (3b561b7). The CRT default struct stat/fstat are 32-bit → EOVERFLOW ("value too large") on the ~80 GB q2 model. Remaps stat/fstat_stat64 family. Both the HIP (hipcc/MSVC) and MinGW CPU builds verified.

Indexer scalar fallback — fixes #348 (4faede4). On HIP, indexer_scores_wmma_kernel was launched unconditionally but its body is gated by #if __CUDA_ARCH__ >= 700, which hipcc never defines → the kernel compiles to an empty body and leaves scores uninitialized on the head_dim==128, n_head==64, !quality path (silent, nondeterministic wrong output). This guards the launch so HIP without full rocWMMA falls through to the scalar indexer_scores_kernel. (#311 instead enables the WMMA body on HIP via <rocwmma/rocwmma.hpp>, which can't compile on the Windows HIP SDK — see #348 for the full analysis.) Correctness confirmed via the deterministic MoE check (DS4_CUDA_MOE_NO_ATOMIC_DOWN=1top1_mismatch=0).

Warp pair-matmul (8795f3e). Ports the coalesced warp-stride f16 pair-matmul (one warp per output row, reduced via warp_sum_f32); gen-only (n_tok==1), CUDA path byte-identical, default ON (revert DS4_ROCM_NO_F16_PAIR_WARP_MATMUL=1). This is #311's kernel — included here because we measured it, but trivially dropped if you'd rather keep it with #311.

Benchmarks (Radeon 8060S, gfx1151, q2-imatrix 80.76 GiB, 96 GB VRAM split)

Full 38-run methodology + data: https://github.com/jamesburton/ds4-rocm/tree/main/results

Test environment

  • GPU: AMD Radeon 8060S (gfx1151, Strix Halo), 128 GB LPDDR5X-8533, BIOS 96 GB iGPU VRAM split.
  • Toolchain: AMD HIP SDK for Windows, ROCm 7.1; MSVC ABI; MinGW GCC 15.2 for the CPU build.
  • Runtime env: DS4_CUDA_COPY_MODEL_CHUNKED=1, DS4_LOCK_FILE=<win path>.

Dependencies

Built on #344 (CPU build), #313 (DS4_CUDA_MANAGED), #320 (chunked copy) — all currently open. Their commits are included in this branch; this PR is cleanest reviewed/merged after them. Fixes #348.

kmc6042 and others added 7 commits June 6, 2026 17:55
Add DS4_CUDA_MANAGED=1 environment variable that switches
ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to
cudaMallocManaged (full unified memory pool).

This is critical for UMA platforms like AMD Strix Halo where the
BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory
(128 GB). Without this, context buffers are limited to the BIOS
carve-out, capping usable context at ~870K when the model weights
occupy ~81 GB. With managed memory, the full 128 GB pool is
available, enabling 1M-token context.

The change is opt-in and zero-overhead when the env var is unset.

Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm
compatibility.

Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151):
  DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000
  → context buffers 17222.50 MiB, server starts successfully
  → 1M context chat completion: 8.71 t/s, correct output

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When DS4_CUDA_COPY_MODEL_CHUNKED is set, skip cudaHostRegister.
Registering a large memory map prevents posix_madvise(DONTNEED)
from freeing pages during the chunked copy, leading to catastrophic
system RAM exhaustion on APUs with unified memory.
Adds native Windows CPU build support for ds4-bench using MinGW-w64 GCC
(no WSL, Cygwin or MSVC). A small header-only POSIX shim, ds4_win.h,
supplies the surface MinGW/UCRT lacks: mmap/munmap/madvise, sysconf,
flock/fcntl/pread/dprintf, and a temp-file-backed fmemopen. MinGW already
provides pthread, clock_gettime and ftruncate.

The shim is wired in-tree behind #ifdef _WIN32: ds4.c includes ds4_win.h
in place of <sys/mman.h> on Windows. The header's whole body is guarded by
_WIN32, so POSIX (Linux/macOS/CUDA/ROCm) builds are unchanged.

Makefile detects MinGW/MSYS (uname -s = MINGW*/MSYS*), defaults CC to gcc
there, and adds a `windows-cpu` target that builds ds4-bench.exe with
-DDS4_NO_GPU. Darwin/Linux/cuda/rocm paths are untouched. win/README.md
documents the build (make + direct-gcc fallback), the DS4_LOCK_FILE runtime
note, and the deferred CLI (termios/sigaction) and server (Winsock) ports.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build DS4's GPU backend natively on Windows with the AMD HIP SDK — no
WSL, no MSVC on PATH, no full Visual Studio needed at the command line
(hipcc auto-discovers the VS build tools' headers). Produces a gfx1151
ROCm ds4-bench.exe that links the HIP runtime + hipBLAS and starts up,
selecting the GPU backend before model load.

Toolchain findings (Windows HIP SDK 7.1):
- hipcc.exe compiles, links, and runs HIP + hipBLAS for gfx1151 using
  its bundled clang (target x86_64-pc-windows-msvc); no cl.exe on PATH.
- The host ABI is therefore MSVC, so the C host files are compiled with
  clang --target=...-windows-msvc to match the hipcc-built ds4_cuda.o
  (mixing MinGW and MSVC C runtimes across the FILE*/heap boundary is
  unsafe). The MinGW CPU build is left exactly as-is.

Changes (all GPU-build portability behind #ifdef _WIN32, sub-guarded by
__MINGW32__ / DS4_WIN_PTHREAD, so POSIX/macOS/CUDA/Linux-ROCm builds are
byte-for-byte unchanged):
- Vendor the rocWMMA version header (MIT, header-only) the Windows SDK
  omits; ds4_rocm.h needs only this (DS4's WMMA path is CUDA-only).
- ds4_cuda.cu: use the ds4_win.h POSIX shim + <io.h> instead of
  <unistd.h>; skip the Linux-only st_blksize O_DIRECT hint.
- ds4_win.h: add STDERR_FILENO/ssize_t/SSIZE_MAX/off_t, clock_gettime +
  CLOCK_MONOTONIC, and ftruncate for the MSVC-ABI build (guarded
  !__MINGW32__ so the MinGW CPU build is untouched).
- win/ds4_pthread_win.h: header-only Win32 pthread shim (threads, mutex,
  condvar, once) used when DS4_WIN_PTHREAD is set (MSVC has no pthread.h).
- win/build-rocm.sh: build script that works around hipcc.exe's
  space-splitting .bat wrapper, and synthesizes the missing MSVC-style
  hipblas.lib from libhipblas.dll via llvm-dlltool (git-ignored, cached).
- Makefile: add a `windows-rocm` target (ROCM_PATH/ROCM_ARCH, gfx1151)
  delegating to the script; windows-cpu unchanged.
- win/README.md: native Windows ROCm build instructions, rocWMMA
  vendoring, import-lib rationale, and run caveats (PATH, DS4_LOCK_FILE,
  DS4_CUDA_MANAGED).

Runtime: ds4-bench.exe initializes the backend and computes context
buffers; full inference is not yet verified (needs the ~80 GB model and
a BIOS UMA re-split + reboot). Success here = clean gfx1151 compile +
link + startup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Windows CRT default struct stat / stat() / fstat() carry a 32-bit
st_size, so stat'ing a file larger than 2GB fails with EOVERFLOW
("value too large") — fatal for the ~80GB DeepSeek V4 GGUF, which the
loader fstat()s before mmap. Remap the bare names to the 64-bit _stat64
family (which names both the struct and the functions, with an __int64
st_size). <sys/stat.h> is pulled in first so its real declarations are
parsed before the macros exist; the include guard makes ds4.c's and
ds4_cuda.cu's later includes (opposite order) no-ops that still see the
remap. Verified on both the hipcc/MSVC-ABI GPU build and the MinGW CPU build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction math

Port of the platform-independent correctness fixes from antirez#311 (9565c07)
onto the native-Windows-ROCm base, plus the latent indexer bug found
during the Windows port.

(B) Indexer WMMA correctness bug (affects ALL ROCm builds):
indexer_scores_wmma_kernel's body is guarded by `#if __CUDA_ARCH__ >= 700`,
which hipcc never defines, so on HIP it compiles to an empty kernel - yet
indexer_scores_launch() still LAUNCHED it on the head_dim==128, n_head==64,
!quality_mode path, leaving `scores` uninitialised (wrong results). The
upstream fix (9565c07) re-enables the WMMA body on HIP via rocWMMA, but the
full rocWMMA matrix API is not available in this Windows HIP SDK build (only
rocwmma-version.hpp is vendored). Instead, guard the WMMA launch with
`#ifndef __HIP_PLATFORM_AMD__` so HIP falls through to the scalar
indexer_scores_kernel, which fully initialises `scores`. CUDA keeps WMMA.

(A) Precise expert-selection math (from 9565c07): route softplus_dev and the
three router_select_* sqrt(softplus(.)) score computations through
ds4_precise_{expf,log1pf,sqrtf}. On ROCm these bind to the OCML entry points
(__ocml_{exp,log1p,sqrt}_f32) so -fapprox-func/fast-math cannot substitute the
lower-precision approximations and flip expert selection. On CUDA they are thin
wrappers over expf/log1pf/sqrtf - byte-identical behaviour, no CUDA/Metal/CPU
regression.

Verified: `make windows-rocm` rebuilds ds4-bench.exe clean for gfx1151;
`make windows-cpu` still builds. Runtime ./ds4_test verification is pending
the BIOS UMA re-split + reboot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port matmul_f16_pair_warp_kernel from upstream PR antirez#311 (9565c07) onto the
clean base. One warp (32 lanes) computes one output row across both weight
matrices of the paired F16 matmul, with ROWS_PER_BLOCK warps per block.
Selected in ds4_gpu_matmul_f16_pair_tensor for the gen path (n_tok==1),
gated by DS4_ROCM_NO_F16_PAIR_WARP_MATMUL (revert to ordered_chunks).

Mathematically identical to the existing ordered_chunks kernel (same
row-major dot product); the win is coalesced global reads (warp-stride
vs per-thread contiguous chunk). HIP-only (#ifdef __HIP_PLATFORM_AMD__),
so the CUDA path is byte-for-byte unchanged.

Measured on Strix Halo gfx1151 (Radeon 8060S), q2-imatrix 80.76 GB,
thermal-controlled A/B: gen 8.09->10.53 t/s @2k, 7.95->10.52 t/s @4k (+30%).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants