(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] EngineCore failed to start.
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] super().__init__(
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return self.collective_rpc("determine_available_memory")
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 385, in determine_available_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] self._init_minimal_kv_cache_for_profiling()
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] assert new_spec.page_size_bytes == max_page_size
(EngineCore pid=2518) ERROR 06-23 18:03:05 [core.py:1136] AssertionError
DFlash speculative decoding crashes with AssertionError in unify_kv_cache_spec_page_size on hybrid model Qwen3.5-35B-A3B (H20)
current environment
dflash)Summary
Starting
vllm servefor the hybrid modelQwen3.5-35B-A3Bwith DFlash speculative decoding crashes during engine init with anAssertionErrorinunify_kv_cache_spec_page_size. The exact same command works fine for the reporter of #42505 on an RTX PRO 6000, so this appears to be hardware / page-size dependent.script:
vllm serve /parent-dir/Qwen3.5-35B-A3B \ --speculative-config '{"method": "dflash", "model": "/parent-dir/Qwen3.5-35B-A3B-DFlash/", "num_speculative_tokens": 8}' \ --attention-backend flash_attn \ --max-num-batched-tokens 32768 \ --max-num-seqs 16 \ --moe-backend triton \ --safetensors-load-strategy=prefetch \ --max-model-len 183872What happens
The main model and the drafter load successfully. During KV cache profiling, page-size unification fails:
The drafter uses auxiliary attention layers from the speculative config:
Full traceback