Skip to content

Feature Request: DSpark confidence-scheduled verification & semi-autoregressive drafting #25096

Description

@gangula-karthik

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

This proposal adds two complementary enhancements to llama.cpp's speculative decoding pipeline, inspired by the DSpark paper ("DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation", DeepSeek-AI):

  1. Confidence-Scheduled Verification. Today, the draft loop in common/speculative.cpp already performs per-token greedy early-stopping: it stops extending a draft as soon as a single token's top probability falls below --spec-draft-p-min (speculative.cpp:331). This proposal adds two things that don't currently exist:

    • Joint (cumulative) survival probability. Instead of comparing each token independently against a fixed threshold, accumulate a running product of the per-token draft probabilities and truncate the draft when the joint probability that the whole prefix survives falls below a threshold. For longer drafts this is stricter than the per-token check, since the product decays even when every individual token clears p_min.
    • Load-aware verification scheduling. Tailor the verification length per request using the engine's current throughput/concurrency profile, not just draft-side confidence, so that under high concurrency, batch capacity isn't spent verifying low-survival tail tokens. llama.cpp currently has no load-aware component here.
  2. Semi-Autoregressive (SAR) Draft Model Support (draft-sar). Add a new speculative type for SAR draft models, which pair a parallel backbone (full block proposal in one forward pass) with a lightweight sequential refinement module that introduces intra-block token dependencies.

Motivation

Confidence-Scheduled Verification: The existing p_min early-stop is a good per-token heuristic, but it makes each token's keep/drop decision in isolation. DSpark's insight is that what matters for wasted verification is the joint probability that an entire draft prefix survives, combined with how loaded the serving engine currently is. Under high concurrency, verifying tail tokens that are jointly unlikely to be accepted consumes target-model batch slots that could serve other requests. A cumulative-survival cutoff plus load-aware scheduling lets the system spend verification capacity where it pays off. DSpark reports this shifts the throughput-vs-latency Pareto frontier meaningfully in production serving. This change is inference-only and applies to all existing draft strategies (draft-simple, draft-eagle3, draft-mtp) with no new model weights.

SAR Draft Model Support: Parallel drafters (e.g. DFlash, MTP-style multi-token heads) lack inter-token dependencies, generate a whole block of draft tokens in one pass, treating each position independently. This causes suffix decay, acceptance rates fall sharply for later positions in the block because those tokens aren't conditioned on the earlier drafts in the same block. The SAR architecture adds a sequential refinement pass over the parallel backbone's output to model intra-block dependencies, directly attacking suffix decay. DSpark reports 60–85% per-user generation speedup over an MTP-1 baseline at matched throughput on live DeepSeek-V4 traffic. Supporting draft-sar gives llama.cpp users access to this where SAR draft heads are available or can be trained.

Possible Implementation

Feature 1 - Confidence-Scheduled Verification:

  • Add a parameter (e.g. --spec-draft-p-survival-min, float, default 0.0 = disabled) to common_params_speculative_draft (common/common.h:323), alongside the existing p_min/p_split.
  • In the draft loop in common/speculative.cpp (the section around speculative.cpp:304-351), maintain a running product of the accepted draft tokens' top probabilities (cur_p->data[0].p is already available) and stop extending the draft when the product drops below the threshold — complementing, not replacing, the existing per-token p_min check.
  • For the load-aware component, the verification-length cap can be modulated by current batch occupancy / slot pressure in the server scheduler. This is the larger and more invasive part and could land as a follow-up after the joint-survival cutoff.
  • This is inference-only and benefits all existing draft-model strategies.

Feature 2 - SAR Draft Model Support:

This follows the same pathway as EAGLE-3 (draft-eagle3):

  • Add COMMON_SPECULATIVE_TYPE_DRAFT_SAR to the common_speculative_type enum.
  • Implement a new speculative backend in common/speculative.cpp for the SAR two-pass drafting loop: (a) run parallel backbone in one forward pass to get a full draft block; (b) run the lightweight sequential refinement module over the block to refine token predictions with intra-block conditioning.
  • Extend convert_hf_to_gguf.py to support conversion of SAR draft head checkpoints from the DeepSpec format to GGUF.
  • The SAR draft head architecture and open-source training code is available at: https://github.com/deepseek-ai/DeepSpec

Reference: DSpark paper
Related llama.cpp work: EAGLE-3 (#18039), MTP heads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions