Misc. bug: Blackwell GGML-CUDA SOFT_MAX Crash

# Blackwell GGML-CUDA SOFT_MAX Crash Fix

**Author:** Not a programmer — just a security guy who asked DeepSeek-v4 to
write the patches.
**Date:** 2026-06-26
**Target:** llama.cpp ggml-cuda on Blackwell RTX 5090 (SM 12.0), CUDA 13.3,
Driver 580.17

---

## What Went Wrong

Running large models (35B MoE) on an RTX 5090 caused an immediate crash:

```
ggml_cuda_compute_forward: SOFT_MAX failed
```

Under the hood: `cudaErrorInvalidValue` on `cudaLaunchKernel` for the softmax
kernel. The driver said **"Requested 1152 bytes of shared memory, but max is
0 bytes."**

## Root Cause

Two separate bugs:

### Bug 1: The driver handed us a nonsense number

At startup, llama.cpp reads `sharedMemPerBlockOptin` from the GPU to know how
much shared memory it can request. On this RTX 5090 with CUDA 13.3, that value
came back as **4294967297** (= 0x100000001) instead of the correct **101376**.

Why? Unknown. Could be a CUDA 13.3 driver bug on Blackwell, or a struct layout
mismatch in the old CUDA headers the code was compiled with. Either way, the
software then fed that garbage number to `cudaFuncSetAttribute`, which
silently rejected it and left the limit at 0 — meaning any kernel asking for
shared memory was dead on arrival.

**Fix:** Validate the value before using it. If it's below the guaranteed
default or above 256 KB (no GPU has ever had more), use the safe default
instead. This is generic across all architectures, not a Blackwell hack.

### Bug 2: The cache missed every call after the first

The code used a macro `CUDA_SET_SHARED_MEMORY_LIMIT` that only called
`cudaFuncSetAttribute` once per GPU, then cached the result. But softmax
kernels are template instantiations — the same line of source code produces
multiple actual kernel functions at compile time (`soft_max_f32<true, 256,
256>`, `soft_max_f32<true, 512, 256>`, etc.). The cache only saved the flag
for the *first* one it saw, leaving every other instantiation with no shared
memory limit set.

**Fix:** Track kernel function pointers in an array instead of a boolean.
Each unique kernel address gets its own entry. No more missed calls.

## Files Changed

| File | Lines | What |
|------|-------|------|
| `ggml/src/ggml-cuda/common.cuh` | +15/-4 | Fixed `CUDA_SET_SHARED_MEMORY_LIMIT` to track per-kernel (not per-device) |
| `ggml/src/ggml-cuda/ggml-cuda.cu` | +11/-1 | Added bounds validation on `sharedMemPerBlockOptin` |
| `ggml/src/ggml-cuda/softmax.cu` | +0/-1 | Removed blank line (debug artifact) |

**Total: 21 insertions, 6 deletions**

## Verified On

- **GPU:** NVIDIA GeForce RTX 5090 (32 GB VRAM)
- **Model:** Qwen3.6-35B-A3B Q4_K_XL
- **Benchmark:** pp128=4231 t/s, tg256=262 t/s
- **Errors:** Zero

## Patch

See attached `ggml-cuda-blackwell-fix.patch`.

---

*This fix was written entirely by an AI agent at my direction. I cannot audit
CUDA code. If you are smarter than me (likely), please review and upstream a
better version.*

### First Bad Commit

_No response_

### Relevant log output

  0.03.207.415 E /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error                                                                                                                                                          
     ggml_cuda_compute_forward: SOFT_MAX failed                                                                                                                                                                                                                  
     0.03.207.421 E CUDA error: invalid argument                                                                                                                                                                                                                 
     0.03.207.423 E   current device: 0, in function ggml_cuda_compute_forward at /mnt/storage/Projects/llama-cpp-exp/ggml/src/ggml-cuda/ggml-cuda.cu:3128                                                                                                       
     0.03.207.423 E   err                

[ggml-cuda-blackwell-fix.patch](https://github.com/user-attachments/files/29391973/ggml-cuda-blackwell-fix.patch)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Blackwell GGML-CUDA SOFT_MAX Crash #25060

Blackwell GGML-CUDA SOFT_MAX Crash Fix

What Went Wrong

Root Cause

Bug 1: The driver handed us a nonsense number

Bug 2: The cache missed every call after the first

Files Changed

Verified On

Patch

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File	Lines	What
`ggml/src/ggml-cuda/common.cuh`	+15/-4	Fixed `CUDA_SET_SHARED_MEMORY_LIMIT` to track per-kernel (not per-device)
`ggml/src/ggml-cuda/ggml-cuda.cu`	+11/-1	Added bounds validation on `sharedMemPerBlockOptin`
`ggml/src/ggml-cuda/softmax.cu`	+0/-1	Removed blank line (debug artifact)

Uh oh!

Misc. bug: Blackwell GGML-CUDA SOFT_MAX Crash #25060

Description

Blackwell GGML-CUDA SOFT_MAX Crash Fix

What Went Wrong

Root Cause

Bug 1: The driver handed us a nonsense number

Bug 2: The cache missed every call after the first

Files Changed

Verified On

Patch

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions