Skip to content

Misc. bug: Blackwell GGML-CUDA SOFT_MAX Crash #25060

Description

@giveen

Blackwell GGML-CUDA SOFT_MAX Crash Fix

Author: Not a programmer — just a security guy who asked DeepSeek-v4 to
write the patches.
Date: 2026-06-26
Target: llama.cpp ggml-cuda on Blackwell RTX 5090 (SM 12.0), CUDA 13.3,
Driver 580.17


What Went Wrong

Running large models (35B MoE) on an RTX 5090 caused an immediate crash:

ggml_cuda_compute_forward: SOFT_MAX failed

Under the hood: cudaErrorInvalidValue on cudaLaunchKernel for the softmax
kernel. The driver said "Requested 1152 bytes of shared memory, but max is
0 bytes."

Root Cause

Two separate bugs:

Bug 1: The driver handed us a nonsense number

At startup, llama.cpp reads sharedMemPerBlockOptin from the GPU to know how
much shared memory it can request. On this RTX 5090 with CUDA 13.3, that value
came back as 4294967297 (= 0x100000001) instead of the correct 101376.

Why? Unknown. Could be a CUDA 13.3 driver bug on Blackwell, or a struct layout
mismatch in the old CUDA headers the code was compiled with. Either way, the
software then fed that garbage number to cudaFuncSetAttribute, which
silently rejected it and left the limit at 0 — meaning any kernel asking for
shared memory was dead on arrival.

Fix: Validate the value before using it. If it's below the guaranteed
default or above 256 KB (no GPU has ever had more), use the safe default
instead. This is generic across all architectures, not a Blackwell hack.

Bug 2: The cache missed every call after the first

The code used a macro CUDA_SET_SHARED_MEMORY_LIMIT that only called
cudaFuncSetAttribute once per GPU, then cached the result. But softmax
kernels are template instantiations — the same line of source code produces
multiple actual kernel functions at compile time (soft_max_f32<true, 256, 256>, soft_max_f32<true, 512, 256>, etc.). The cache only saved the flag
for the first one it saw, leaving every other instantiation with no shared
memory limit set.

Fix: Track kernel function pointers in an array instead of a boolean.
Each unique kernel address gets its own entry. No more missed calls.

Files Changed

File Lines What
ggml/src/ggml-cuda/common.cuh +15/-4 Fixed CUDA_SET_SHARED_MEMORY_LIMIT to track per-kernel (not per-device)
ggml/src/ggml-cuda/ggml-cuda.cu +11/-1 Added bounds validation on sharedMemPerBlockOptin
ggml/src/ggml-cuda/softmax.cu +0/-1 Removed blank line (debug artifact)

Total: 21 insertions, 6 deletions

Verified On

  • GPU: NVIDIA GeForce RTX 5090 (32 GB VRAM)
  • Model: Qwen3.6-35B-A3B Q4_K_XL
  • Benchmark: pp128=4231 t/s, tg256=262 t/s
  • Errors: Zero

Patch

See attached ggml-cuda-blackwell-fix.patch.


This fix was written entirely by an AI agent at my direction. I cannot audit
CUDA code. If you are smarter than me (likely), please review and upstream a
better version.

First Bad Commit

No response

Relevant log output

0.03.207.415 E /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
ggml_cuda_compute_forward: SOFT_MAX failed
0.03.207.421 E CUDA error: invalid argument
0.03.207.423 E current device: 0, in function ggml_cuda_compute_forward at /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:3128
0.03.207.423 E err

ggml-cuda-blackwell-fix.patch

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions