DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)

# DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)

### current environment

- vLLM version: 0.20.1
- GPU: NVIDIA H20 x 1
- Model: Qwen3.5-35B-A3B (hybrid architecture: full-attention + GDN/linear-attention layers)
- Draft model: z-lab/Qwen3.5-35B-A3B-DFlash (speculative method: `dflash`)
- Attention backend: flash_attn (FlashAttention v3)
- MoE backend: triton


**Summary**

Starting `vllm serve` for the hybrid model `Qwen3.5-35B-A3B` **with DFlash speculative decoding** crashes during engine init with an `AssertionError` in `unify_kv_cache_spec_page_size`. The exact same command works fine for the reporter of #42505 on an RTX PRO 6000, so this appears to be hardware / page-size dependent.

**script:**

```bash
vllm serve /parent-dir/Qwen3.5-35B-A3B \
  --speculative-config '{"method": "dflash", "model": "/parent-dir/Qwen3.5-35B-A3B-DFlash/", "num_speculative_tokens": 8}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 16 \
  --moe-backend triton \
  --safetensors-load-strategy=prefetch \
  --max-model-len 183872
```

**What happens**

The main model and the drafter load successfully. During KV cache profiling, page-size unification fails:

```
INFO  [interface.py:606] Setting attention block size to 1088 tokens to ensure that attention page size is >= mamba page size.
INFO  [interface.py:630] Padding mamba page size by 0.74% to ensure that mamba page size and attention page size are exactly equal.
...
ERROR [core.py:1136] EngineCore failed to start.
ERROR [core.py:1136] AssertionError
```

The drafter uses auxiliary attention layers from the speculative config:

```
INFO [gpu_model_runner.py:4839] Using auxiliary layers from speculative config: (1, 6, 11, 16, 22, 27, 32, 37)
```

<details>
<summary>Full traceback</summary>

```
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] EngineCore failed to start.
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     super().__init__(
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 385, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     self._init_minimal_kv_cache_for_profiling()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     assert new_spec.page_size_bytes == max_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] AssertionError
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20) #139

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)

current environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20) #139

Description

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)

current environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions