Skip to content

[feat] add support for D2SD-mode VP-Drafter training#2

Open
catnanami wants to merge 1 commit into
deepseek-ai:mainfrom
catnanami:add-dflash-d2-feature
Open

[feat] add support for D2SD-mode VP-Drafter training#2
catnanami wants to merge 1 commit into
deepseek-ai:mainfrom
catnanami:add-dflash-d2-feature

Conversation

@catnanami

@catnanami catnanami commented Jun 27, 2026

Copy link
Copy Markdown

Motivation

This PR adds training support for the VP-Drafter used in D2SD (Dual Diffusion Draft Speculative Decoding). D2SD extends DFlash by using a first DFlash draft to estimate likely rejection boundaries, then training a second variable-prefix drafter to re-anchor at selected prefixes and generate alternative continuations.

The key training requirement is different from standard DFlash: the drafter must learn from variable-length visible prefixes instead of always seeing only the anchor token followed by masks. This PR implements that behavior as a DFlash training-mode branch.

pipeline_contrast

References:

Modifications

  • Added D2-style feature support to dflash training.

  • The D2 mode samples a variable visible prefix length per sampled anchor block.

  • Prefix tokens are fed as real token embeddings; suffix positions remain masked and contribute to loss.

  • Added D2 prefix-length sampling controlled by d2_prefix_weight_base.

  • Added D2-specific loss masking so visible prefix positions are excluded from supervision.

  • Added loss decay offsets so exponential decay starts from the first masked suffix position.

  • Wired Qwen3 and Gemma4 DSpark/dflash models to read enable_d2_feature and d2_prefix_weight_base from draft config.

  • Wired Qwen3 and Gemma4 draft config builders to propagate D2 feature config from model_args.

  • Enabled D2 feature in all dflash configs:

    • config/dflash/dflash_qwen3_4b.py
    • config/dflash/dflash_qwen3_8b.py
    • config/dflash/dflash_qwen3_14b.py
    • config/dflash/dflash_gemma4_12b.py
  • Added shared helper utilities for D2 prefix sampling, D2 noise embedding construction, and D2 eval-mask construction.

  • Kept the original DSpark/dflash training path unchanged when enable_d2_feature=False.

@catnanami catnanami changed the title feat: add d2 feature for dflash training [feat] add support for D2SD-mode VP-Drafter training Jun 27, 2026
@findshan

Copy link
Copy Markdown

Thanks for the implementation. A few thoughts on the D²SD approach itself:
The core idea of confidence-guided re-anchoring is elegant — using r(i) to localize the rejection boundary is clearly more principled than naive resampling, and the ablation backs this up.
That said, I have a concern about what the ablations actually establish. Table 5 compares D²SD against naive resampling at a matched branch count (K=4), but not at matched compute: the naive branches are essentially a logits view of a single forward, whereas D²SD adds a full VP-Drafter pass. So the ablation shows the re-anchoring placement is better than random — but not that spending a second drafter's worth of compute on branching is itself the right call.
On throughput: the speedup numbers are measured under single-request / low-batch latency, where the extra draft cost is hidden by parallelism. In high-concurrency serving, two separate draft passes (DFlash + VP-Drafter) roughly double the draft-side compute and eat into batch capacity. DSpark notes that even MTP-3/5 degrades aggregate throughput under load from verification overhead alone — a second drafter compounds this. Has the throughput-vs-interactivity tradeoff been evaluated under batched serving?
On compute allocation: the underlying tension is whether the second-draft budget is better spent on branching (D²SD) or on improving single-sequence quality (e.g. DSpark's lightweight Markov head, which recovers much of the suffix decay without a second model). At matched compute, it's not obvious branching wins — and as far as I can tell there's no iso-FLOP comparison against a quality-focused single-sequence baseline in the paper.
Genuinely curious whether there's a regime where D²SD is strictly better — perhaps very low concurrency / latency-critical single-user settings

@catnanami

Copy link
Copy Markdown
Author

@zdaxie Thank you for your interest in our work and for raising these very valuable points. We find your comments highly insightful!

Regarding the first question, we did not strictly control the compute cost of the two experiments to be exactly the same within a single iteration. In other words, what we actually controlled was the same verification budget, rather than the same per-iteration compute cost. Since D²SD indeed introduces a higher draft cost due to the second draft pass, in our experimental comparison we report not only the single-step accepted length, but also the end-to-end speedup. Our goal is to examine whether the additional draft cost introduced by the second draft pass would offset the overall end-to-end speedup gained from the increase in TPF brought by D²SD.

Regarding the second question, we acknowledge that since the verification budget of D²SD is significantly higher than that of DFlash or MTP, its overall throughput will noticeably decrease under high-concurrency settings. The additional cost introduced by the second draft pass in real serving frameworks is also an issue we have been actively investigating. We plan to adapt D²SD to SGLang and vLLM in the near future, and we expect this method to be mainly effective in low-concurrency scenarios.

Regarding the third question, we allocate the second-draft budget to branching mainly to exploit the parallelism of the second draft pass. We previously tried improving single-sequence quality by resampling from the most likely rejection boundary. However, the additional draft cost introduced by the second draft pass would offset the end-to-end speedup gained from the increase in TPF. By contrast, starting parallel drafts from multiple possible rejection boundaries can significantly improve TPF without substantially increasing the draft cost, especially in low-concurrency settings.

As for further comparisons, after completing the adaptation to serving frameworks, we plan to conduct experiments under high-concurrency settings and compare D²SD end-to-end against most existing methods, in order to better identify the regime where D²SD has an advantage.

Finally, thank you again for your constructive suggestions on our work, and we also sincerely appreciate the significant contributions that DSpark has made to the speculative decoding community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants