win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352
Draft
jamesburton wants to merge 7 commits into
Draft
win: native Windows ROCm/HIP GPU build for gfx1151 (Strix Halo) + correctness (fixes #348)#352jamesburton wants to merge 7 commits into
jamesburton wants to merge 7 commits into
Conversation
Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to cudaMallocManaged (full unified memory pool). This is critical for UMA platforms like AMD Strix Halo where the BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory (128 GB). Without this, context buffers are limited to the BIOS carve-out, capping usable context at ~870K when the model weights occupy ~81 GB. With managed memory, the full 128 GB pool is available, enabling 1M-token context. The change is opt-in and zero-overhead when the env var is unset. Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm compatibility. Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151): DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000 → context buffers 17222.50 MiB, server starts successfully → 1M context chat completion: 8.71 t/s, correct output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When DS4_CUDA_COPY_MODEL_CHUNKED is set, skip cudaHostRegister. Registering a large memory map prevents posix_madvise(DONTNEED) from freeing pages during the chunked copy, leading to catastrophic system RAM exhaustion on APUs with unified memory.
Adds native Windows CPU build support for ds4-bench using MinGW-w64 GCC (no WSL, Cygwin or MSVC). A small header-only POSIX shim, ds4_win.h, supplies the surface MinGW/UCRT lacks: mmap/munmap/madvise, sysconf, flock/fcntl/pread/dprintf, and a temp-file-backed fmemopen. MinGW already provides pthread, clock_gettime and ftruncate. The shim is wired in-tree behind #ifdef _WIN32: ds4.c includes ds4_win.h in place of <sys/mman.h> on Windows. The header's whole body is guarded by _WIN32, so POSIX (Linux/macOS/CUDA/ROCm) builds are unchanged. Makefile detects MinGW/MSYS (uname -s = MINGW*/MSYS*), defaults CC to gcc there, and adds a `windows-cpu` target that builds ds4-bench.exe with -DDS4_NO_GPU. Darwin/Linux/cuda/rocm paths are untouched. win/README.md documents the build (make + direct-gcc fallback), the DS4_LOCK_FILE runtime note, and the deferred CLI (termios/sigaction) and server (Winsock) ports. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build DS4's GPU backend natively on Windows with the AMD HIP SDK — no WSL, no MSVC on PATH, no full Visual Studio needed at the command line (hipcc auto-discovers the VS build tools' headers). Produces a gfx1151 ROCm ds4-bench.exe that links the HIP runtime + hipBLAS and starts up, selecting the GPU backend before model load. Toolchain findings (Windows HIP SDK 7.1): - hipcc.exe compiles, links, and runs HIP + hipBLAS for gfx1151 using its bundled clang (target x86_64-pc-windows-msvc); no cl.exe on PATH. - The host ABI is therefore MSVC, so the C host files are compiled with clang --target=...-windows-msvc to match the hipcc-built ds4_cuda.o (mixing MinGW and MSVC C runtimes across the FILE*/heap boundary is unsafe). The MinGW CPU build is left exactly as-is. Changes (all GPU-build portability behind #ifdef _WIN32, sub-guarded by __MINGW32__ / DS4_WIN_PTHREAD, so POSIX/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged): - Vendor the rocWMMA version header (MIT, header-only) the Windows SDK omits; ds4_rocm.h needs only this (DS4's WMMA path is CUDA-only). - ds4_cuda.cu: use the ds4_win.h POSIX shim + <io.h> instead of <unistd.h>; skip the Linux-only st_blksize O_DIRECT hint. - ds4_win.h: add STDERR_FILENO/ssize_t/SSIZE_MAX/off_t, clock_gettime + CLOCK_MONOTONIC, and ftruncate for the MSVC-ABI build (guarded !__MINGW32__ so the MinGW CPU build is untouched). - win/ds4_pthread_win.h: header-only Win32 pthread shim (threads, mutex, condvar, once) used when DS4_WIN_PTHREAD is set (MSVC has no pthread.h). - win/build-rocm.sh: build script that works around hipcc.exe's space-splitting .bat wrapper, and synthesizes the missing MSVC-style hipblas.lib from libhipblas.dll via llvm-dlltool (git-ignored, cached). - Makefile: add a `windows-rocm` target (ROCM_PATH/ROCM_ARCH, gfx1151) delegating to the script; windows-cpu unchanged. - win/README.md: native Windows ROCm build instructions, rocWMMA vendoring, import-lib rationale, and run caveats (PATH, DS4_LOCK_FILE, DS4_CUDA_MANAGED). Runtime: ds4-bench.exe initializes the backend and computes context buffers; full inference is not yet verified (needs the ~80 GB model and a BIOS UMA re-split + reboot). Success here = clean gfx1151 compile + link + startup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Windows CRT default struct stat / stat() / fstat() carry a 32-bit
st_size, so stat'ing a file larger than 2GB fails with EOVERFLOW
("value too large") — fatal for the ~80GB DeepSeek V4 GGUF, which the
loader fstat()s before mmap. Remap the bare names to the 64-bit _stat64
family (which names both the struct and the functions, with an __int64
st_size). <sys/stat.h> is pulled in first so its real declarations are
parsed before the macros exist; the include guard makes ds4.c's and
ds4_cuda.cu's later includes (opposite order) no-ops that still see the
remap. Verified on both the hipcc/MSVC-ABI GPU build and the MinGW CPU build.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction math Port of the platform-independent correctness fixes from antirez#311 (9565c07) onto the native-Windows-ROCm base, plus the latent indexer bug found during the Windows port. (B) Indexer WMMA correctness bug (affects ALL ROCm builds): indexer_scores_wmma_kernel's body is guarded by `#if __CUDA_ARCH__ >= 700`, which hipcc never defines, so on HIP it compiles to an empty kernel - yet indexer_scores_launch() still LAUNCHED it on the head_dim==128, n_head==64, !quality_mode path, leaving `scores` uninitialised (wrong results). The upstream fix (9565c07) re-enables the WMMA body on HIP via rocWMMA, but the full rocWMMA matrix API is not available in this Windows HIP SDK build (only rocwmma-version.hpp is vendored). Instead, guard the WMMA launch with `#ifndef __HIP_PLATFORM_AMD__` so HIP falls through to the scalar indexer_scores_kernel, which fully initialises `scores`. CUDA keeps WMMA. (A) Precise expert-selection math (from 9565c07): route softplus_dev and the three router_select_* sqrt(softplus(.)) score computations through ds4_precise_{expf,log1pf,sqrtf}. On ROCm these bind to the OCML entry points (__ocml_{exp,log1p,sqrt}_f32) so -fapprox-func/fast-math cannot substitute the lower-precision approximations and flip expert selection. On CUDA they are thin wrappers over expf/log1pf/sqrtf - byte-identical behaviour, no CUDA/Metal/CPU regression. Verified: `make windows-rocm` rebuilds ds4-bench.exe clean for gfx1151; `make windows-cpu` still builds. Runtime ./ds4_test verification is pending the BIOS UMA re-split + reboot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port matmul_f16_pair_warp_kernel from upstream PR antirez#311 (9565c07) onto the clean base. One warp (32 lanes) computes one output row across both weight matrices of the paired F16 matmul, with ROWS_PER_BLOCK warps per block. Selected in ds4_gpu_matmul_f16_pair_tensor for the gen path (n_tok==1), gated by DS4_ROCM_NO_F16_PAIR_WARP_MATMUL (revert to ordered_chunks). Mathematically identical to the existing ordered_chunks kernel (same row-major dot product); the win is coalesced global reads (warp-stride vs per-thread contiguous chunk). HIP-only (#ifdef __HIP_PLATFORM_AMD__), so the CUDA path is byte-for-byte unchanged. Measured on Strix Halo gfx1151 (Radeon 8060S), q2-imatrix 80.76 GB, thermal-controlled A/B: gen 8.09->10.53 t/s @2k, 7.95->10.52 t/s @4k (+30%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a native Windows ROCm/HIP GPU build of ds4 for gfx1151 (AMD Strix Halo / Ryzen AI MAX+ 395, Radeon 8060S iGPU) — no WSL — and fixes a ROCm correctness bug along the way. Built and run on real silicon with the AMD HIP SDK for Windows (ROCm 7.1).
This continues the native-Windows series started in #344 (CPU build) and fixes #348 (uninitialized indexer scores on HIP).
Commits (over
rocm)e217b65DS4_CUDA_MANAGEDfor full UMA pool9953dc006b282c3cb9e283b561b7EOVERFLOW)4faede48795f3eDetails
Native Windows ROCm build (
3cb9e28). Compilesds4_cuda.cuwithhipcc/clangfrom the AMD ROCm 7.1 Windows SDK (MSVC ABI) via a dependency-free POSIX shim — no WSL, no/opt/rocm. The Windows HIP SDK ships only the rocWMMA version header (norocwmma/tree), which motivates the next two fixes.64-bit file stat (
3b561b7). The CRT defaultstruct stat/fstatare 32-bit →EOVERFLOW("value too large") on the ~80 GB q2 model. Remapsstat/fstat→_stat64family. Both the HIP (hipcc/MSVC) and MinGW CPU builds verified.Indexer scalar fallback — fixes #348 (
4faede4). On HIP,indexer_scores_wmma_kernelwas launched unconditionally but its body is gated by#if __CUDA_ARCH__ >= 700, whichhipccnever defines → the kernel compiles to an empty body and leavesscoresuninitialized on thehead_dim==128, n_head==64, !qualitypath (silent, nondeterministic wrong output). This guards the launch so HIP without full rocWMMA falls through to the scalarindexer_scores_kernel. (#311 instead enables the WMMA body on HIP via<rocwmma/rocwmma.hpp>, which can't compile on the Windows HIP SDK — see #348 for the full analysis.) Correctness confirmed via the deterministic MoE check (DS4_CUDA_MOE_NO_ATOMIC_DOWN=1→top1_mismatch=0).Warp pair-matmul (
8795f3e). Ports the coalesced warp-stride f16 pair-matmul (one warp per output row, reduced viawarp_sum_f32); gen-only (n_tok==1), CUDA path byte-identical, default ON (revertDS4_ROCM_NO_F16_PAIR_WARP_MATMUL=1). This is #311's kernel — included here because we measured it, but trivially dropped if you'd rather keep it with #311.Benchmarks (Radeon 8060S, gfx1151, q2-imatrix 80.76 GiB, 96 GB VRAM split)
-t 4. In line with the gfx1151 figures in rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent #311 (~11.79) / cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo #313 (~8.71 @ 1M ctx).Full 38-run methodology + data: https://github.com/jamesburton/ds4-rocm/tree/main/results
Test environment
DS4_CUDA_COPY_MODEL_CHUNKED=1,DS4_LOCK_FILE=<win path>.Dependencies
Built on #344 (CPU build), #313 (
DS4_CUDA_MANAGED), #320 (chunked copy) — all currently open. Their commits are included in this branch; this PR is cleanest reviewed/merged after them. Fixes #348.