Skip to content

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20) #139

Description

@zhanyin-kun

DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)

current environment

  • vLLM version: 0.20.1
  • GPU: NVIDIA H20 x 1
  • Model: Qwen3.5-35B-A3B (hybrid architecture: full-attention + GDN/linear-attention layers)
  • Draft model: z-lab/Qwen3.5-35B-A3B-DFlash (speculative method: dflash)
  • Attention backend: flash_attn (FlashAttention v3)
  • MoE backend: triton

Summary

Starting vllm serve for the hybrid model Qwen3.5-35B-A3B with DFlash speculative decoding crashes during engine init with an AssertionError in unify_kv_cache_spec_page_size. The exact same command works fine for the reporter of #42505 on an RTX PRO 6000, so this appears to be hardware / page-size dependent.

script:

vllm serve /parent-dir/Qwen3.5-35B-A3B \
  --speculative-config '{"method": "dflash", "model": "/parent-dir/Qwen3.5-35B-A3B-DFlash/", "num_speculative_tokens": 8}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 16 \
  --moe-backend triton \
  --safetensors-load-strategy=prefetch \
  --max-model-len 183872

What happens

The main model and the drafter load successfully. During KV cache profiling, page-size unification fails:

INFO  [interface.py:606] Setting attention block size to 1088 tokens to ensure that attention page size is >= mamba page size.
INFO  [interface.py:630] Padding mamba page size by 0.74% to ensure that mamba page size and attention page size are exactly equal.
...
ERROR [core.py:1136] EngineCore failed to start.
ERROR [core.py:1136] AssertionError

The drafter uses auxiliary attention layers from the speculative config:

INFO [gpu_model_runner.py:4839] Using auxiliary layers from speculative config: (1, 6, 11, 16, 22, 27, 32, 37)
Full traceback
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] EngineCore failed to start.
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     super().__init__(
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 385, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     self._init_minimal_kv_cache_for_profiling()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136]     assert new_spec.page_size_bytes == max_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions