The dflash model seems does not has good performance in dgx spark box

The dflash seems does not has good performance in dgx spark box.

I test gemma4-26B and qwen3.6-35B, but the average throughput decreased 10% than baseline model.

Does anyone know how to optimize?

the docker image : `vllm/vllm-openai:v0.21.0-aarch64-ubuntu2404`

 the vllm command :

 ```
      vllm serve google/gemma-4-26B-A4B-it \
          --served-model-name gemma-4-26B-A4B-it \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --max-model-len 262144 \
          --max-num-batched-tokens 65536 \
          --gpu-memory-utilization 0.78 \
          --host 0.0.0.0 \
          --port 8000 \
          --attention-backend triton_attn \
          --trust-remote-code \
          --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 8}'

        vllm serve Qwen/Qwen3.6-35B-A3B \
          --served-model-name Qwen3.6-35B-A3B-Dflash \
          --enable-auto-tool-choice \
          --tool-call-parser qwen3_coder \
          --max-model-len 262144 \
          --max-num-batched-tokens 32768 \
          --gpu-memory-utilization 0.78 \
          --host 0.0.0.0 \
          --port 8000 \
          --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

```
**Note:** 
**the `"attention_backend": "flash_attn"` does not support in `--speculative-config `, I have to remove it to make vllm work.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The dflash model seems does not has good performance in dgx spark box #129

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

The dflash model seems does not has good performance in dgx spark box #129

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions