Skip to content

Eval bug: Vision CUDA crash (0.3.1 regression): Qwen3.6 35B A3B — missing context checkpoint invalidation on image token position jump #54

@chromascope-x

Description

@chromascope-x

Name and Version

BeeLlama 0.3.1 (broken) vs 0.2.0 (working)

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 7950X3D + 3080ti

Models

Qwen3.6 35B A3B with mmproj

Problem description & steps to reproduce

Qwen3.6 35B A3B with mmproj crashes on image input in 0.3.1. The same model and mmproj work correctly in 0.2.0. Gemma 4 12B vision works fine in the same 0.3.1 build, suggesting the issue is specific to Qwen3.6's image token layout. Tested against vanilla llama.cpp B9512 which also handles the same input without crashing.

Root cause (observed):
During image processing, vision tokens produce a non-consecutive position jump (position 39 → 127). Upstream llama.cpp B9512 handles this gracefully by invalidating the affected checkpoint:

erased invalidated context checkpoint (pos_min = 127, pos_max = 127, n_tokens = 4123...)

BeeLlama 0.3.1 does not perform this invalidation and proceeds with the invalid position, triggering:

CUDA error: an illegal memory access was encountered
ggml_backend_cuda_synchronize at ggml-cuda.cu:3291
cudaStreamSynchronize(cuda_ctx->stream())

First Bad Commit

No response

Relevant log output

BeeLlama 0.3.1 log (crash):

0.18.278.005 I srv  process_chun: image processed in 9966 ms
0.18.283.000 W find_slot: non-consecutive token position 127 after 39 with 4 new tokens
0.18.283.042 W find_slot: non-consecutive token position 127 after 39 with 4 new tokens
0.18.346.565 E CUDA error: an illegal memory access was encountered
ggml-cuda.cu:104: CUDA error
0.18.346.572 E current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3291
cudaStreamSynchronize(cuda_ctx->stream())

llama.cpp B9512 (working):

1.09.689.228 W slot update_slots: erased invalidated context checkpoint
             (pos_min = 127, pos_max = 127, n_tokens = 4123, size = 62.813 MiB)
→ prefill continues normally at 852 t/s

Expected behavior: Non-consecutive image token positions should be handled gracefully (invalidate checkpoint and continue), as implemented in upstream B9512.

Notes:

  • Non-multimodal (text-only) requests work fine in 0.3.1
  • Gemma 4 12B vision works correctly in 0.3.1 (no crash)
  • Gemma 4 uses SWA (n_swa=1024) and produces no non-consecutive position warnings during image processing
  • Qwen3.6 35B A3B is the affected model
  • The issue is specific to Qwen3.6's image token layout (position jump 39 → 127)
  • Qwen3.6 35B A3B has no SWA — the SWA architecture may handle image token position discontinuities differently at the checkpoint level
  • The regression appears to be in context checkpoint invalidation logic for multimodal token positions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions