Skip to content

Eval bug: CUDA error: unsupported value or parameter in cublasSgemm_v2 during large context processing #25061

Description

@rbenrax

Name and Version

version: 9775 (be4a6a6)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

2 x rtx3060

Models

Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

Problem description & steps to reproduce

Bug Description

The llama.cpp server crashes with a CUDA error when processing a prompt with a very large context. The error occurs specifically in cublasSgemm_v2 during matrix multiplication.

Error Message

First Bad Commit

Stack Trace#0 __GI___wait4

#1 ggml_print_backtrace
#2 ggml_abort
#3 ggml_cuda_error
#4 ggml_cuda_op_mul_mat_cublas
#5 ggml_cuda_op_mul_mat
#6 ggml_backend_cuda_graph_compute
#7 ggml_backend_sched_graph_compute_async
#8 llama_context::graph_compute
#9 llama_context::process_ubatch
#10 llama_context::decode
#11 llama_decode
#12 server_context_impl::decode
#13 server_context_impl::update_slots
#14 server_queue::start_loop
#15 llama_server
#16 __libc_start_call_main
#17 __libc_start_main_impl
#18 _start

Relevant log output

Details ## Context Information from Logs
  • Slot ID: 1
  • Task ID: 19910
  • Context slot size (n_ctx_slot): 131,072 tokens
  • Prompt tokens (task.n_tokens): 105,375 tokens
  • Cached tokens: 105,371 (after incremental caching)
  • Context checkpoint created: 32 of 32 checkpoints
  • Checkpoint size: 269.744 MiB
  • Position: 104,923 tokens

Environment

  • OS: Linux (Ubuntu/Debian based on paths)
  • llama.cpp path: /opt/llama.cpp-beta/
  • CUDA device: 0
  • Build type: Beta branch (llama.cpp-beta)

Steps to Reproduce

  1. Start llama.cpp server with a model supporting large context (n_ctx = 131072)
  2. Send a prompt with approximately 105,375 tokens
  3. The server processes the prompt, creates context checkpoints, caches tokens
  4. During the decode phase, the CUDA error occurs in cublasSgemm_v2

Additional Notes

  • The error happens after successfully caching most of the prompt (105,371 out of 105,375 tokens)
  • Context checkpoints were being created successfully (32 checkpoints)
  • The error specifically mentions "an unsupported value or parameter was passed to the function" in cuBLAS
  • This suggests a potential integer overflow or dimension mismatch when passing matrix dimensions to cublasSgemm_v2

Possible Related Issues

  • Large matrix dimensions causing integer overflow in cuBLAS parameters
  • Context size exceeding certain CUDA/cuBLAS limits
  • Memory alignment issues with very large tensors
Logs

/opt/llama.cpp-beta/build/bin/llama-server
-m /mnt/disco2/models/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
-mm /mnt/disco2/models/Qwen3.6-35B-A3B-GGUF/mmproj-Qwen3.6-35B-A3B-BF16.gguf
--image-min-tokens 1024
-mg 1
-ngl 999
-sm layer
-ts 10,12
-t 8
-tb 6
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence-penalty 0.0
--repeat-penalty 1.0
--reasoning on
--spec-type draft-mtp
--spec-draft-n-max 3
--jinja
-b 256
-ub 256
-np 3
--cache-idle-slots
-c 131072
-fa on
-kvu
--cache-type-k q8_0
--cache-type-v q8_0
--host 0.0.0.0
--port 11434

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions