Name and Version
version: 9775 (be4a6a6)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x rtx3060
Models
Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Problem description & steps to reproduce
Bug Description
The llama.cpp server crashes with a CUDA error when processing a prompt with a very large context. The error occurs specifically in cublasSgemm_v2 during matrix multiplication.
Error Message
First Bad Commit
Stack Trace#0 __GI___wait4
#1 ggml_print_backtrace
#2 ggml_abort
#3 ggml_cuda_error
#4 ggml_cuda_op_mul_mat_cublas
#5 ggml_cuda_op_mul_mat
#6 ggml_backend_cuda_graph_compute
#7 ggml_backend_sched_graph_compute_async
#8 llama_context::graph_compute
#9 llama_context::process_ubatch
#10 llama_context::decode
#11 llama_decode
#12 server_context_impl::decode
#13 server_context_impl::update_slots
#14 server_queue::start_loop
#15 llama_server
#16 __libc_start_call_main
#17 __libc_start_main_impl
#18 _start
Relevant log output
Details
## Context Information from Logs
- Slot ID: 1
- Task ID: 19910
- Context slot size (n_ctx_slot): 131,072 tokens
- Prompt tokens (task.n_tokens): 105,375 tokens
- Cached tokens: 105,371 (after incremental caching)
- Context checkpoint created: 32 of 32 checkpoints
- Checkpoint size: 269.744 MiB
- Position: 104,923 tokens
Environment
- OS: Linux (Ubuntu/Debian based on paths)
- llama.cpp path:
/opt/llama.cpp-beta/
- CUDA device: 0
- Build type: Beta branch (
llama.cpp-beta)
Steps to Reproduce
- Start llama.cpp server with a model supporting large context (n_ctx = 131072)
- Send a prompt with approximately 105,375 tokens
- The server processes the prompt, creates context checkpoints, caches tokens
- During the decode phase, the CUDA error occurs in
cublasSgemm_v2
Additional Notes
- The error happens after successfully caching most of the prompt (105,371 out of 105,375 tokens)
- Context checkpoints were being created successfully (32 checkpoints)
- The error specifically mentions "an unsupported value or parameter was passed to the function" in cuBLAS
- This suggests a potential integer overflow or dimension mismatch when passing matrix dimensions to
cublasSgemm_v2
Possible Related Issues
- Large matrix dimensions causing integer overflow in cuBLAS parameters
- Context size exceeding certain CUDA/cuBLAS limits
- Memory alignment issues with very large tensors
Logs
/opt/llama.cpp-beta/build/bin/llama-server
-m /mnt/disco2/models/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
-mm /mnt/disco2/models/Qwen3.6-35B-A3B-GGUF/mmproj-Qwen3.6-35B-A3B-BF16.gguf
--image-min-tokens 1024
-mg 1
-ngl 999
-sm layer
-ts 10,12
-t 8
-tb 6
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence-penalty 0.0
--repeat-penalty 1.0
--reasoning on
--spec-type draft-mtp
--spec-draft-n-max 3
--jinja
-b 256
-ub 256
-np 3
--cache-idle-slots
-c 131072
-fa on
-kvu
--cache-type-k q8_0
--cache-type-v q8_0
--host 0.0.0.0
--port 11434
Name and Version
version: 9775 (be4a6a6)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x rtx3060
Models
Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Problem description & steps to reproduce
Bug Description
The llama.cpp server crashes with a CUDA error when processing a prompt with a very large context. The error occurs specifically in
cublasSgemm_v2during matrix multiplication.Error Message
First Bad Commit
Stack Trace#0 __GI___wait4
#1 ggml_print_backtrace
#2 ggml_abort
#3 ggml_cuda_error
#4 ggml_cuda_op_mul_mat_cublas
#5 ggml_cuda_op_mul_mat
#6 ggml_backend_cuda_graph_compute
#7 ggml_backend_sched_graph_compute_async
#8 llama_context::graph_compute
#9 llama_context::process_ubatch
#10 llama_context::decode
#11 llama_decode
#12 server_context_impl::decode
#13 server_context_impl::update_slots
#14 server_queue::start_loop
#15 llama_server
#16 __libc_start_call_main
#17 __libc_start_main_impl
#18 _start
Relevant log output
Details
## Context Information from LogsEnvironment
/opt/llama.cpp-beta/llama.cpp-beta)Steps to Reproduce
cublasSgemm_v2Additional Notes
cublasSgemm_v2Possible Related Issues
Logs
/opt/llama.cpp-beta/build/bin/llama-server
-m /mnt/disco2/models/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
-mm /mnt/disco2/models/Qwen3.6-35B-A3B-GGUF/mmproj-Qwen3.6-35B-A3B-BF16.gguf
--image-min-tokens 1024
-mg 1
-ngl 999
-sm layer
-ts 10,12
-t 8
-tb 6
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence-penalty 0.0
--repeat-penalty 1.0
--reasoning on
--spec-type draft-mtp
--spec-draft-n-max 3
--jinja
-b 256
-ub 256
-np 3
--cache-idle-slots
-c 131072
-fa on
-kvu
--cache-type-k q8_0
--cache-type-v q8_0
--host 0.0.0.0
--port 11434