feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32
Open
lalalune wants to merge 7 commits into
Open
feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32lalalune wants to merge 7 commits into
lalalune wants to merge 7 commits into
Conversation
…ther seam - LLM_ARCH_GEMMA4_ASSISTANT enum + arch name + nextn/masked tensor enums/names/infos - hparams n_embd_inp_impl + n_embd_inp() override - model fields nextn_proj_pre/post + models.h struct decl - llm_graph_result t_h_nextn + get_h_nextn - src/models/gemma4-assistant.cpp (ported from upstream, fork naming) - ctx_other on cparams + llama_context_params + llama_get_ctx_other - llama_memory_params.mem_other + layer_share_cb Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-type - llama_kv_cache / llama_kv_cache_iswa ctors take mem_other + share - per-layer share() maps drafter layers to sibling target KV layers - all existing callers pass nullptr (additive, behavior unchanged) - model factory dispatch + NEOX rope-type for gemma4-assistant - create_memory builds share lambda + mem_other for gemma4-assistant Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
server-context sets cparams.ctx_other = ctx_tgt on both MTP draft-context branches so the gemma4-assistant drafter can read the target's token embeddings and share its KV cache. Harmless for other draft arches (only that arch reads it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…edup Three correctness fixes that take draft acceptance from ~2% to ~41%: 1. MTP runtime row width = n_embd_out (backbone 1536), not the drafter's internal n_embd (256); assert it matches the target hidden width. 2. pre-norm extraction (context.cpp) + get_embeddings_pre_norm_ith + embd_pre_norm buffer now sized/strided by n_embd_out, so the n_embd_out-wide nextn hidden survives the round trip. 3. target gemma4 graph now exposes its POST-final-norm hidden via t_h_pre_norm (the LM-head input feature the drafter consumes), matching upstream's t_h_nextn; gemma4-assistant routes its nextn hidden through the same seam. Result on M4 Max Metal: 133.8 t/s with draft-mtp vs 109.5 t/s baseline (1.22x). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
embd_pre_norm.size now derives from n_embd_out; the n_embd local is dead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
lalalune
added a commit
to elizaOS/eliza
that referenced
this pull request
Jun 24, 2026
…support (#9268) Points the fork submodule at the gemma4-assistant port (validated: loads Google's Gemma-4 MTP drafter + ~1.1x decode speedup via --spec-type draft-mtp on M4 Max Metal). bcae29e65 (prior gitlink) is a clean ancestor, so this is a fast-forward that adds the metal-tbq attn-score fix + the gemma4-assistant arch. Tracks fork PR elizaOS/llama.cpp#32; re-point to the merged commit once that lands. This is the runtime half of the Gemma-4 MTP drafter work — with this, the fused engine built from the fork can load mtp/drafter-<tier>.gguf (the amaranus/Google gemma-4-E2B-it-assistant head, wired in #9256) and run separate-drafter MTP. Co-authored-by: Shaw <shawgotbags@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ant)
The fused engine builds its MTP draft context (eliza-inference-ffi.cpp) without
ctx_other, so a separate-drafter MTP arch that reaches into the target context
(gemma4-assistant: target tok_embd + hidden) hard-failed in the llama-context
ctor ("Gemma4Assistant requires ctx_other to be set"). server-context.cpp set it;
the fused FFI path did not. Set cp.ctx_other = e->ctx_tgt before init (inert for
same-file MTP archs that don't consult ctx_other).
Validated via the fused FFI on M4 Max Metal: loads the amaranus gemma4-assistant
drafter + runs MTP — drafted 28 / accepted 12 (43%), coherent output, 112 tok/s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
lalalune
added a commit
to elizaOS/eliza
that referenced
this pull request
Jun 24, 2026
… fused engine (#9273) Follow-up to the gemma4-assistant arch bump: the fused engine (libelizainference, the actual eliza runtime) built its MTP draft context without ctx_other, so a gemma4-assistant drafter hard-failed there ("requires ctx_other"). Fixed in the fork (elizaOS/llama.cpp#32) — set cp.ctx_other = ctx_tgt before init. Validated end-to-end via the fused FFI on M4 Max Metal: the fused engine loads the amaranus gemma4-assistant drafter + runs separate-drafter MTP (drafted 28 / accepted 12 = 43%, coherent output, 112 tok/s). This is the runtime path the plugin uses — so the Gemma-4 MTP drafter now works in eliza, not just llama-cli. Co-authored-by: Shaw <shawgotbags@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ch IFGO The fused pyannote-segmentation-3.0 diarizer read the BiLSTM gates as PyTorch I,F,G,O while the GGUF actually packs them in ONNX I,O,F,C order. The scrambled forget/output/cell gates made the diarizer over-detect overlap and hallucinate speakers on inputs near the decision boundary (single-speaker golden-stt → 3 speakers, DER 0.79); it slipped past the small parity-fixture suite because those inputs argmax to the right label anyway. Verified against the reference ONNX (onnx-community/pyannote-segmentation-3.0): with IOFC the C LSTM-0 output now matches the reference to float precision, and golden-stt diarizes to 1 speaker / 0 overlap (DER ~0), agreeing with the reference across single-speaker, silence, and mixed windows. Refs elizaOS/eliza#9460.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports the upstream
gemma4-assistantmodel architecture into the fork so it can load + run Google's official Gemma-4 MTP drafter head (e.g.amaranus/Gemma-4-E2B-it-qat-assistant-MTP-Q8_0) via--spec-type draft-mtp.Validated (Apple M4 Max, Metal)
gemma4-assistantcleanly (49 drafter tensors map; no unknown-arch / missing-tensor errors).Additive
mem_other/share/ctx_otherdefault to null for every existing arch (all existing callers pass null), so non-assistant behavior is unchanged. The only shared change: the target gemma4 graph now also publishes its post-final-norm hidden viat_h_pre_norm(aliases the already-outputt_embd— free + inert unlessembeddings_pre_normis enabled).Beyond a verbatim port (correctness fixes in the fork's MTP path)
n_embd(256) instead ofn_embd_out(backbone 1536) — fixed inspeculative.cpp+ the threeembeddings_pre_normwidth/stride sites.res->t_h_pre_norm = cur(acceptance ~8% → ~41%).swa_full=trueoverride on SWA draft contexts oversized the drafter's SWA cache vs the shared target and crashedget_k— gated off forspec_mtp.Plus KV-cache ctor
mem_other/shareseam (preserving the fork's trailingkv_size_max), alayer_share_cbtypedef, andctx_other=ctx_tgtwiring for draft-mtp contexts. 22 files, +419/−26, builds clean.🤖 Generated with Claude Code