fix(context): draft model fit vs load inconsistency by wadealexc · Pull Request #25056 · ggml-org/llama.cpp

wadealexc · 2026-06-26T15:47:06Z

Overview

This PR fixes an inconsistency between draft model fit and load. On master, draft models are fit to VRAM using params_dft.n_outputs_max = params_base.n_parallel. However, when we actually load the draft model, we don't do this, leading n_outputs_max to inherit the much larger value calculated by server_n_outputs_max. This leads to a slight over-allocation for the draft model.

Note: this PR could be a one-line fix, but after reviewing the code, I felt the root cause of the oversight was that draft/mtp fit calculations and model/context load are spread across the load_model method, leading to several inconsistencies. To fully address the underlying issue, this PR has 2 commits:

the first commit fixes the fit/load bug
the second commit refactors draft/mtp parameter initialization and load (see details below)

I split these into 2 commits for ease of review. If you want me to remove the second commit, let me know - but I think the refactor makes load_model much cleaner 😄

Additional information

The second commit unifies draft/mtp param initialization, and model and context loading.

Draft/MTP model/context are now initialized almost identically to the main model - a constructor that uses a pimpl and exposes model/context via accessor methods. server_context_impl has been changed so that model_dft and ctx_dft are raw pointers, like the main model.

I also added a helper for common_params initialization that captures the behavior spread across a few different places in load_model and provides a single method that should handle all initialization cases correctly. (see common_base_params_to_speculative). This single method also sets n_outputs_max, so all locations receive the corrected value.

One more tweak: when common_speculative_init fails, we now reset the model as well as the context (master only resets the context).

Requirements

I have read and agree with the contributing guidelines: Yes
AI usage disclosure: Yes:
- I used Qwen 3.5 27B to help me find/correct all of the locations in server-context.cpp where model_dft and ctx_dft were referenced (they're now raw pointers)
- I used Claude to help review my fixes and investigate the n_outputs_max discrepancy.

…d context load - moves speculative init to speculative.cpp - changes server_context_impl model_dft and ctx_dft to use raw pointers

wadealexc · 2026-06-27T15:04:33Z

Rebased. I was thinking about changing common_init_speculative to be called common_init_speculative_from_params to better match the main model method (and distinguish from common_speculative_init), but I don't want to step on your review if you're working on this @ggerganov

ggerganov · 2026-06-27T15:31:00Z

No problem - I'll probably review this in about a week, so feel free to make changes until then.

wadealexc requested review from a team as code owners June 26, 2026 15:47

github-actions Bot added the server label Jun 26, 2026

ggerganov self-assigned this Jun 27, 2026

wadealexc added 2 commits June 27, 2026 11:01

fix: draft model fit vs load inconsistency

aa2d976

refactor(server): unify draft/mtp parameter initialization, model, an…

e985236

…d context load - moves speculative init to speculative.cpp - changes server_context_impl model_dft and ctx_dft to use raw pointers

wadealexc force-pushed the fix/draft-load-context branch from fedef14 to e985236 Compare June 27, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(context): draft model fit vs load inconsistency#25056

fix(context): draft model fit vs load inconsistency#25056
wadealexc wants to merge 2 commits into
ggml-org:masterfrom
wadealexc:fix/draft-load-context

wadealexc commented Jun 26, 2026 •

edited

Loading

Uh oh!

wadealexc commented Jun 27, 2026

Uh oh!

ggerganov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wadealexc commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

wadealexc commented Jun 27, 2026

Uh oh!

ggerganov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wadealexc commented Jun 26, 2026 •

edited

Loading