Skip to content

The dflash model seems does not has good performance in dgx spark box #129

Description

@carvinrui

The dflash seems does not has good performance in dgx spark box.

I test gemma4-26B and qwen3.6-35B, but the average throughput decreased 10% than baseline model.

Does anyone know how to optimize?

the docker image : vllm/vllm-openai:v0.21.0-aarch64-ubuntu2404

the vllm command :

     vllm serve google/gemma-4-26B-A4B-it \
         --served-model-name gemma-4-26B-A4B-it \
         --enable-auto-tool-choice \
         --tool-call-parser gemma4 \
         --max-model-len 262144 \
         --max-num-batched-tokens 65536 \
         --gpu-memory-utilization 0.78 \
         --host 0.0.0.0 \
         --port 8000 \
         --attention-backend triton_attn \
         --trust-remote-code \
         --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 8}'

       vllm serve Qwen/Qwen3.6-35B-A3B \
         --served-model-name Qwen3.6-35B-A3B-Dflash \
         --enable-auto-tool-choice \
         --tool-call-parser qwen3_coder \
         --max-model-len 262144 \
         --max-num-batched-tokens 32768 \
         --gpu-memory-utilization 0.78 \
         --host 0.0.0.0 \
         --port 8000 \
         --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

Note:
the "attention_backend": "flash_attn" does not support in --speculative-config , I have to remove it to make vllm work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions