Skip to content

Acceptance rate lower than benchmark with Qwen3-8B + Dflash #133

Description

@wsb853529465

My VLLM test results show an acceptance length of less than 3, mainly due to a lower-than-normal acceptance rate at each location.
Could you please explain why? Below are my parameter configurations.

Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7f97671293a0>, trust_remote_code=False, seed=0, num_prompts=16, dataset_name='custom', no_stream=False, dataset_path='/workspace/dataset/human-eval/data/HumanEval.jsonl', no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio='0.0', random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, speed_bench_dataset_subset='qualitative', speed_bench_output_len=4096, speed_bench_category=None, label=None, backend='openai', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', header=None, max_concurrency=1, model='/workspace/models/Qwen3-8B', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=2, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-f5d0aa47-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds='25,50', plot_dataset_stats=False)
serve
vllm serve /workspace/models/Qwen3-8B --attention-backend flash_attn --trust-remote-code --max-num-batched-tokens 32768 --speculative-config '{"model": "/workspace/models/dflash-zlab","num_speculative_tokens": 15,"method": "dflash"}' --tensor-parallel-size 2
request
vllm bench serve --backend openai --model /workspace/models/Qwen3-8B --num-prompts 16 --dataset-path /workspace/dataset/human-eval/data/HumanEval.jsonl --dataset-name custom --max-concurrency 1 --num-warmups 2 --temperature 0
result
============ Serving Benchmark Result ============
Successful requests: 128
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 129.94
Total input tokens: 17954
Total generated tokens: 32768
Request throughput (req/s): 0.99
Output token throughput (tok/s): 252.18
Peak output token throughput (tok/s): 93.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 390.36
---------------Time to First Token----------------
Mean TTFT (ms): 27.45
Median TTFT (ms): 25.10
P99 TTFT (ms): 124.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 3.87
Median TPOT (ms): 3.89
P99 TPOT (ms): 4.65
---------------Inter-token Latency----------------
Mean ITL (ms): 10.84
Median ITL (ms): 10.83
P99 ITL (ms): 11.35
---------------Speculative Decoding---------------
Acceptance rate (%): 12.14
Acceptance length: 2.82
Drafts: 11662
Draft tokens: 174930
Accepted tokens: 21230
Per-position acceptance (%):
Position 0: 70.76
Position 1: 43.05
Position 2: 25.17
Position 3: 15.48
Position 4: 9.56
Position 5: 6.09
Position 6: 3.78
Position 7: 2.53
Position 8: 1.72
Position 9: 1.13
Position 10: 0.83
Position 11: 0.66
Position 12: 0.55
Position 13: 0.43
Position 14: 0.31
==================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions