Skip to content

fix(context): draft model fit vs load inconsistency#25056

Open
wadealexc wants to merge 2 commits into
ggml-org:masterfrom
wadealexc:fix/draft-load-context
Open

fix(context): draft model fit vs load inconsistency#25056
wadealexc wants to merge 2 commits into
ggml-org:masterfrom
wadealexc:fix/draft-load-context

Conversation

@wadealexc

@wadealexc wadealexc commented Jun 26, 2026

Copy link
Copy Markdown

Overview

This PR fixes an inconsistency between draft model fit and load. On master, draft models are fit to VRAM using params_dft.n_outputs_max = params_base.n_parallel. However, when we actually load the draft model, we don't do this, leading n_outputs_max to inherit the much larger value calculated by server_n_outputs_max. This leads to a slight over-allocation for the draft model.

Note: this PR could be a one-line fix, but after reviewing the code, I felt the root cause of the oversight was that draft/mtp fit calculations and model/context load are spread across the load_model method, leading to several inconsistencies. To fully address the underlying issue, this PR has 2 commits:

  • the first commit fixes the fit/load bug
  • the second commit refactors draft/mtp parameter initialization and load (see details below)

I split these into 2 commits for ease of review. If you want me to remove the second commit, let me know - but I think the refactor makes load_model much cleaner 😄

Additional information

The second commit unifies draft/mtp param initialization, and model and context loading.

Draft/MTP model/context are now initialized almost identically to the main model - a constructor that uses a pimpl and exposes model/context via accessor methods. server_context_impl has been changed so that model_dft and ctx_dft are raw pointers, like the main model.

I also added a helper for common_params initialization that captures the behavior spread across a few different places in load_model and provides a single method that should handle all initialization cases correctly. (see common_base_params_to_speculative). This single method also sets n_outputs_max, so all locations receive the corrected value.

One more tweak: when common_speculative_init fails, we now reset the model as well as the context (master only resets the context).

Requirements

  • I have read and agree with the contributing guidelines: Yes
  • AI usage disclosure: Yes:
    • I used Qwen 3.5 27B to help me find/correct all of the locations in server-context.cpp where model_dft and ctx_dft were referenced (they're now raw pointers)
    • I used Claude to help review my fixes and investigate the n_outputs_max discrepancy.

@wadealexc wadealexc requested review from a team as code owners June 26, 2026 15:47
@ggerganov ggerganov self-assigned this Jun 27, 2026
…d context load

- moves speculative init to speculative.cpp
- changes server_context_impl model_dft and ctx_dft to use raw pointers
@wadealexc wadealexc force-pushed the fix/draft-load-context branch from fedef14 to e985236 Compare June 27, 2026 15:02
@wadealexc

Copy link
Copy Markdown
Author

Rebased. I was thinking about changing common_init_speculative to be called common_init_speculative_from_params to better match the main model method (and distinguish from common_speculative_init), but I don't want to step on your review if you're working on this @ggerganov

@ggerganov

Copy link
Copy Markdown
Member

No problem - I'll probably review this in about a week, so feel free to make changes until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants