The dflash seems does not has good performance in dgx spark box.
I test gemma4-26B and qwen3.6-35B, but the average throughput decreased 10% than baseline model.
Does anyone know how to optimize?
the docker image : vllm/vllm-openai:v0.21.0-aarch64-ubuntu2404
the vllm command :
vllm serve google/gemma-4-26B-A4B-it \
--served-model-name gemma-4-26B-A4B-it \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--max-model-len 262144 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.78 \
--host 0.0.0.0 \
--port 8000 \
--attention-backend triton_attn \
--trust-remote-code \
--speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 8}'
vllm serve Qwen/Qwen3.6-35B-A3B \
--served-model-name Qwen3.6-35B-A3B-Dflash \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.78 \
--host 0.0.0.0 \
--port 8000 \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
Note:
the "attention_backend": "flash_attn" does not support in --speculative-config , I have to remove it to make vllm work.
The dflash seems does not has good performance in dgx spark box.
I test gemma4-26B and qwen3.6-35B, but the average throughput decreased 10% than baseline model.
Does anyone know how to optimize?
the docker image :
vllm/vllm-openai:v0.21.0-aarch64-ubuntu2404the vllm command :
Note:
the
"attention_backend": "flash_attn"does not support in--speculative-config, I have to remove it to make vllm work.