feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp by lalalune · Pull Request #32 · elizaOS/llama.cpp

lalalune · 2026-06-24T06:49:30Z

Ports the upstream gemma4-assistant model architecture into the fork so it can load + run Google's official Gemma-4 MTP drafter head (e.g. amaranus/Gemma-4-E2B-it-qat-assistant-MTP-Q8_0) via --spec-type draft-mtp.

Validated (Apple M4 Max, Metal)

Loads gemma4-assistant cleanly (49 drafter tensors map; no unknown-arch / missing-tensor errors).
Coherent/correct output (greedy parity with baseline).
Draft acceptance ~41–49%; ~1.1× decode speedup (n_max 2, multi-run median; matches upstream's ~1.29× on a less-loaded host).

Additive

mem_other / share / ctx_other default to null for every existing arch (all existing callers pass null), so non-assistant behavior is unchanged. The only shared change: the target gemma4 graph now also publishes its post-final-norm hidden via t_h_pre_norm (aliases the already-output t_embd — free + inert unless embeddings_pre_norm is enabled).

Beyond a verbatim port (correctness fixes in the fork's MTP path)

MTP hidden-state rows were sized by the drafter's n_embd (256) instead of n_embd_out (backbone 1536) — fixed in speculative.cpp + the three embeddings_pre_norm width/stride sites.
The target gemma4 graph never exposed its post-final-norm hidden, so the drafter got no usable target hidden — added res->t_h_pre_norm = cur (acceptance ~8% → ~41%).
The fork's swa_full=true override on SWA draft contexts oversized the drafter's SWA cache vs the shared target and crashed get_k — gated off for spec_mtp.

Plus KV-cache ctor mem_other/share seam (preserving the fork's trailing kv_size_max), a layer_share_cb typedef, and ctx_other=ctx_tgt wiring for draft-mtp contexts. 22 files, +419/−26, builds clean.

🤖 Generated with Claude Code

…ther seam - LLM_ARCH_GEMMA4_ASSISTANT enum + arch name + nextn/masked tensor enums/names/infos - hparams n_embd_inp_impl + n_embd_inp() override - model fields nextn_proj_pre/post + models.h struct decl - llm_graph_result t_h_nextn + get_h_nextn - src/models/gemma4-assistant.cpp (ported from upstream, fork naming) - ctx_other on cparams + llama_context_params + llama_get_ctx_other - llama_memory_params.mem_other + layer_share_cb Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-type - llama_kv_cache / llama_kv_cache_iswa ctors take mem_other + share - per-layer share() maps drafter layers to sibling target KV layers - all existing callers pass nullptr (additive, behavior unchanged) - model factory dispatch + NEOX rope-type for gemma4-assistant - create_memory builds share lambda + mem_other for gemma4-assistant Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

server-context sets cparams.ctx_other = ctx_tgt on both MTP draft-context branches so the gemma4-assistant drafter can read the target's token embeddings and share its KV cache. Harmless for other draft arches (only that arch reads it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…edup Three correctness fixes that take draft acceptance from ~2% to ~41%: 1. MTP runtime row width = n_embd_out (backbone 1536), not the drafter's internal n_embd (256); assert it matches the target hidden width. 2. pre-norm extraction (context.cpp) + get_embeddings_pre_norm_ith + embd_pre_norm buffer now sized/strided by n_embd_out, so the n_embd_out-wide nextn hidden survives the round trip. 3. target gemma4 graph now exposes its POST-final-norm hidden via t_h_pre_norm (the LM-head input feature the drafter consumes), matching upstream's t_h_nextn; gemma4-assistant routes its nextn hidden through the same seam. Result on M4 Max Metal: 133.8 t/s with draft-mtp vs 109.5 t/s baseline (1.22x). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

embd_pre_norm.size now derives from n_embd_out; the n_embd local is dead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-24T06:49:39Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 29d0f32b-0165-46b4-b90a-af6b84c17424

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/gemma4-assistant-arch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

…support (#9268) Points the fork submodule at the gemma4-assistant port (validated: loads Google's Gemma-4 MTP drafter + ~1.1x decode speedup via --spec-type draft-mtp on M4 Max Metal). bcae29e65 (prior gitlink) is a clean ancestor, so this is a fast-forward that adds the metal-tbq attn-score fix + the gemma4-assistant arch. Tracks fork PR elizaOS/llama.cpp#32; re-point to the merged commit once that lands. This is the runtime half of the Gemma-4 MTP drafter work — with this, the fused engine built from the fork can load mtp/drafter-<tier>.gguf (the amaranus/Google gemma-4-E2B-it-assistant head, wired in #9256) and run separate-drafter MTP. Co-authored-by: Shaw <shawgotbags@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…ant) The fused engine builds its MTP draft context (eliza-inference-ffi.cpp) without ctx_other, so a separate-drafter MTP arch that reaches into the target context (gemma4-assistant: target tok_embd + hidden) hard-failed in the llama-context ctor ("Gemma4Assistant requires ctx_other to be set"). server-context.cpp set it; the fused FFI path did not. Set cp.ctx_other = e->ctx_tgt before init (inert for same-file MTP archs that don't consult ctx_other). Validated via the fused FFI on M4 Max Metal: loads the amaranus gemma4-assistant drafter + runs MTP — drafted 28 / accepted 12 (43%), coherent output, 112 tok/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… fused engine (#9273) Follow-up to the gemma4-assistant arch bump: the fused engine (libelizainference, the actual eliza runtime) built its MTP draft context without ctx_other, so a gemma4-assistant drafter hard-failed there ("requires ctx_other"). Fixed in the fork (elizaOS/llama.cpp#32) — set cp.ctx_other = ctx_tgt before init. Validated end-to-end via the fused FFI on M4 Max Metal: the fused engine loads the amaranus gemma4-assistant drafter + runs separate-drafter MTP (drafted 28 / accepted 12 = 43%, coherent output, 112 tok/s). This is the runtime path the plugin uses — so the Gemma-4 MTP drafter now works in eliza, not just llama-cli. Co-authored-by: Shaw <shawgotbags@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…ch IFGO The fused pyannote-segmentation-3.0 diarizer read the BiLSTM gates as PyTorch I,F,G,O while the GGUF actually packs them in ONNX I,O,F,C order. The scrambled forget/output/cell gates made the diarizer over-detect overlap and hallucinate speakers on inputs near the decision boundary (single-speaker golden-stt → 3 speakers, DER 0.79); it slipped past the small parity-fixture suite because those inputs argmax to the right label anyway. Verified against the reference ONNX (onnx-community/pyannote-segmentation-3.0): with IOFC the C LSTM-0 output now matches the reference to float precision, and golden-stt diarizes to 1 speaker / 0 overlap (DER ~0), agreeing with the reference across single-speaker, silence, and mixed windows. Refs elizaOS/eliza#9460.

lalalune and others added 5 commits June 23, 2026 23:28

chore(gemma4-assistant): drop now-unused n_embd local in output_reserve

2323fa8

embd_pre_norm.size now derives from n_embd_out; the n_embd local is dead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lalalune mentioned this pull request Jun 24, 2026

chore(local-inference): bump llama.cpp fork to gemma4-assistant arch support elizaOS/eliza#9268

Merged

github-actions Bot added examples server model labels Jun 24, 2026

lalalune mentioned this pull request Jun 24, 2026

chore(local-inference): bump fork — gemma4-assistant MTP works in the fused engine elizaOS/eliza#9273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32

feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32
lalalune wants to merge 7 commits into
fix/metal-tbq3-tbq4-attn-scorefrom
feat/gemma4-assistant-arch

lalalune commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lalalune commented Jun 24, 2026

Validated (Apple M4 Max, Metal)

Additive

Beyond a verbatim port (correctness fixes in the fork's MTP path)

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading