Eval bug: Vision CUDA crash (0.3.1 regression): Qwen3.6 35B A3B — missing context checkpoint invalidation on image token position jump

### Name and Version

BeeLlama 0.3.1 (broken) vs 0.2.0 (working)

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

Ryzen 7950X3D + 3080ti 

### Models

Qwen3.6 35B A3B with mmproj

### Problem description & steps to reproduce

Qwen3.6 35B A3B with mmproj crashes on image input in 0.3.1. The same model and mmproj work correctly in 0.2.0. Gemma 4 12B vision works fine in the same 0.3.1 build, suggesting the issue is specific to Qwen3.6's image token layout. Tested against vanilla llama.cpp B9512 which also handles the same input without crashing.

Root cause (observed):
During image processing, vision tokens produce a non-consecutive position jump (position 39 → 127). Upstream llama.cpp B9512 handles this gracefully by invalidating the affected checkpoint:
```
erased invalidated context checkpoint (pos_min = 127, pos_max = 127, n_tokens = 4123...)
```

BeeLlama 0.3.1 does not perform this invalidation and proceeds with the invalid position, triggering:

```
CUDA error: an illegal memory access was encountered
ggml_backend_cuda_synchronize at ggml-cuda.cu:3291
cudaStreamSynchronize(cuda_ctx->stream())
```

### First Bad Commit

_No response_

### Relevant log output

BeeLlama 0.3.1 log (crash):
```
0.18.278.005 I srv  process_chun: image processed in 9966 ms
0.18.283.000 W find_slot: non-consecutive token position 127 after 39 with 4 new tokens
0.18.283.042 W find_slot: non-consecutive token position 127 after 39 with 4 new tokens
0.18.346.565 E CUDA error: an illegal memory access was encountered
ggml-cuda.cu:104: CUDA error
0.18.346.572 E current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3291
cudaStreamSynchronize(cuda_ctx->stream())
```

llama.cpp B9512 (working):
```
1.09.689.228 W slot update_slots: erased invalidated context checkpoint
             (pos_min = 127, pos_max = 127, n_tokens = 4123, size = 62.813 MiB)
→ prefill continues normally at 852 t/s
```

**Expected behavior:** Non-consecutive image token positions should be handled gracefully (invalidate checkpoint and continue), as implemented in upstream B9512.

**Notes:**
- Non-multimodal (text-only) requests work fine in 0.3.1
- Gemma 4 12B vision works correctly in 0.3.1 (no crash)
- Gemma 4 uses SWA (n_swa=1024) and produces no non-consecutive position warnings during image processing
- Qwen3.6 35B A3B is the affected model
- The issue is specific to Qwen3.6's image token layout (position jump 39 → 127)
- Qwen3.6 35B A3B has no SWA — the SWA architecture may handle image token position discontinuities differently at the checkpoint level
- The regression appears to be in context checkpoint invalidation logic for multimodal token positions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Vision CUDA crash (0.3.1 regression): Qwen3.6 35B A3B — missing context checkpoint invalidation on image token position jump #54

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: Vision CUDA crash (0.3.1 regression): Qwen3.6 35B A3B — missing context checkpoint invalidation on image token position jump #54

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions