Blackwell GGML-CUDA SOFT_MAX Crash Fix
Author: Not a programmer — just a security guy who asked DeepSeek-v4 to
write the patches.
Date: 2026-06-26
Target: llama.cpp ggml-cuda on Blackwell RTX 5090 (SM 12.0), CUDA 13.3,
Driver 580.17
What Went Wrong
Running large models (35B MoE) on an RTX 5090 caused an immediate crash:
ggml_cuda_compute_forward: SOFT_MAX failed
Under the hood: cudaErrorInvalidValue on cudaLaunchKernel for the softmax
kernel. The driver said "Requested 1152 bytes of shared memory, but max is
0 bytes."
Root Cause
Two separate bugs:
Bug 1: The driver handed us a nonsense number
At startup, llama.cpp reads sharedMemPerBlockOptin from the GPU to know how
much shared memory it can request. On this RTX 5090 with CUDA 13.3, that value
came back as 4294967297 (= 0x100000001) instead of the correct 101376.
Why? Unknown. Could be a CUDA 13.3 driver bug on Blackwell, or a struct layout
mismatch in the old CUDA headers the code was compiled with. Either way, the
software then fed that garbage number to cudaFuncSetAttribute, which
silently rejected it and left the limit at 0 — meaning any kernel asking for
shared memory was dead on arrival.
Fix: Validate the value before using it. If it's below the guaranteed
default or above 256 KB (no GPU has ever had more), use the safe default
instead. This is generic across all architectures, not a Blackwell hack.
Bug 2: The cache missed every call after the first
The code used a macro CUDA_SET_SHARED_MEMORY_LIMIT that only called
cudaFuncSetAttribute once per GPU, then cached the result. But softmax
kernels are template instantiations — the same line of source code produces
multiple actual kernel functions at compile time (soft_max_f32<true, 256, 256>, soft_max_f32<true, 512, 256>, etc.). The cache only saved the flag
for the first one it saw, leaving every other instantiation with no shared
memory limit set.
Fix: Track kernel function pointers in an array instead of a boolean.
Each unique kernel address gets its own entry. No more missed calls.
Files Changed
| File |
Lines |
What |
ggml/src/ggml-cuda/common.cuh |
+15/-4 |
Fixed CUDA_SET_SHARED_MEMORY_LIMIT to track per-kernel (not per-device) |
ggml/src/ggml-cuda/ggml-cuda.cu |
+11/-1 |
Added bounds validation on sharedMemPerBlockOptin |
ggml/src/ggml-cuda/softmax.cu |
+0/-1 |
Removed blank line (debug artifact) |
Total: 21 insertions, 6 deletions
Verified On
- GPU: NVIDIA GeForce RTX 5090 (32 GB VRAM)
- Model: Qwen3.6-35B-A3B Q4_K_XL
- Benchmark: pp128=4231 t/s, tg256=262 t/s
- Errors: Zero
Patch
See attached ggml-cuda-blackwell-fix.patch.
This fix was written entirely by an AI agent at my direction. I cannot audit
CUDA code. If you are smarter than me (likely), please review and upstream a
better version.
First Bad Commit
No response
Relevant log output
0.03.207.415 E /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
ggml_cuda_compute_forward: SOFT_MAX failed
0.03.207.421 E CUDA error: invalid argument
0.03.207.423 E current device: 0, in function ggml_cuda_compute_forward at /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:3128
0.03.207.423 E err
ggml-cuda-blackwell-fix.patch
Blackwell GGML-CUDA SOFT_MAX Crash Fix
Author: Not a programmer — just a security guy who asked DeepSeek-v4 to
write the patches.
Date: 2026-06-26
Target: llama.cpp ggml-cuda on Blackwell RTX 5090 (SM 12.0), CUDA 13.3,
Driver 580.17
What Went Wrong
Running large models (35B MoE) on an RTX 5090 caused an immediate crash:
Under the hood:
cudaErrorInvalidValueoncudaLaunchKernelfor the softmaxkernel. The driver said "Requested 1152 bytes of shared memory, but max is
0 bytes."
Root Cause
Two separate bugs:
Bug 1: The driver handed us a nonsense number
At startup, llama.cpp reads
sharedMemPerBlockOptinfrom the GPU to know howmuch shared memory it can request. On this RTX 5090 with CUDA 13.3, that value
came back as 4294967297 (= 0x100000001) instead of the correct 101376.
Why? Unknown. Could be a CUDA 13.3 driver bug on Blackwell, or a struct layout
mismatch in the old CUDA headers the code was compiled with. Either way, the
software then fed that garbage number to
cudaFuncSetAttribute, whichsilently rejected it and left the limit at 0 — meaning any kernel asking for
shared memory was dead on arrival.
Fix: Validate the value before using it. If it's below the guaranteed
default or above 256 KB (no GPU has ever had more), use the safe default
instead. This is generic across all architectures, not a Blackwell hack.
Bug 2: The cache missed every call after the first
The code used a macro
CUDA_SET_SHARED_MEMORY_LIMITthat only calledcudaFuncSetAttributeonce per GPU, then cached the result. But softmaxkernels are template instantiations — the same line of source code produces
multiple actual kernel functions at compile time (
soft_max_f32<true, 256, 256>,soft_max_f32<true, 512, 256>, etc.). The cache only saved the flagfor the first one it saw, leaving every other instantiation with no shared
memory limit set.
Fix: Track kernel function pointers in an array instead of a boolean.
Each unique kernel address gets its own entry. No more missed calls.
Files Changed
ggml/src/ggml-cuda/common.cuhCUDA_SET_SHARED_MEMORY_LIMITto track per-kernel (not per-device)ggml/src/ggml-cuda/ggml-cuda.cusharedMemPerBlockOptinggml/src/ggml-cuda/softmax.cuTotal: 21 insertions, 6 deletions
Verified On
Patch
See attached
ggml-cuda-blackwell-fix.patch.This fix was written entirely by an AI agent at my direction. I cannot audit
CUDA code. If you are smarter than me (likely), please review and upstream a
better version.
First Bad Commit
No response
Relevant log output
0.03.207.415 E /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
ggml_cuda_compute_forward: SOFT_MAX failed
0.03.207.421 E CUDA error: invalid argument
0.03.207.423 E current device: 0, in function ggml_cuda_compute_forward at /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:3128
0.03.207.423 E err
ggml-cuda-blackwell-fix.patch