Skip to content

feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32

Open
lalalune wants to merge 7 commits into
fix/metal-tbq3-tbq4-attn-scorefrom
feat/gemma4-assistant-arch
Open

feat: add gemma4-assistant arch — Gemma-4 MTP drafter for --spec-type draft-mtp#32
lalalune wants to merge 7 commits into
fix/metal-tbq3-tbq4-attn-scorefrom
feat/gemma4-assistant-arch

Conversation

@lalalune

Copy link
Copy Markdown
Member

Ports the upstream gemma4-assistant model architecture into the fork so it can load + run Google's official Gemma-4 MTP drafter head (e.g. amaranus/Gemma-4-E2B-it-qat-assistant-MTP-Q8_0) via --spec-type draft-mtp.

Validated (Apple M4 Max, Metal)

  • Loads gemma4-assistant cleanly (49 drafter tensors map; no unknown-arch / missing-tensor errors).
  • Coherent/correct output (greedy parity with baseline).
  • Draft acceptance ~41–49%; ~1.1× decode speedup (n_max 2, multi-run median; matches upstream's ~1.29× on a less-loaded host).

Additive

mem_other / share / ctx_other default to null for every existing arch (all existing callers pass null), so non-assistant behavior is unchanged. The only shared change: the target gemma4 graph now also publishes its post-final-norm hidden via t_h_pre_norm (aliases the already-output t_embd — free + inert unless embeddings_pre_norm is enabled).

Beyond a verbatim port (correctness fixes in the fork's MTP path)

  1. MTP hidden-state rows were sized by the drafter's n_embd (256) instead of n_embd_out (backbone 1536) — fixed in speculative.cpp + the three embeddings_pre_norm width/stride sites.
  2. The target gemma4 graph never exposed its post-final-norm hidden, so the drafter got no usable target hidden — added res->t_h_pre_norm = cur (acceptance ~8% → ~41%).
  3. The fork's swa_full=true override on SWA draft contexts oversized the drafter's SWA cache vs the shared target and crashed get_k — gated off for spec_mtp.

Plus KV-cache ctor mem_other/share seam (preserving the fork's trailing kv_size_max), a layer_share_cb typedef, and ctx_other=ctx_tgt wiring for draft-mtp contexts. 22 files, +419/−26, builds clean.

🤖 Generated with Claude Code

lalalune and others added 5 commits June 23, 2026 23:28
…ther seam

- LLM_ARCH_GEMMA4_ASSISTANT enum + arch name + nextn/masked tensor enums/names/infos
- hparams n_embd_inp_impl + n_embd_inp() override
- model fields nextn_proj_pre/post + models.h struct decl
- llm_graph_result t_h_nextn + get_h_nextn
- src/models/gemma4-assistant.cpp (ported from upstream, fork naming)
- ctx_other on cparams + llama_context_params + llama_get_ctx_other
- llama_memory_params.mem_other + layer_share_cb

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-type

- llama_kv_cache / llama_kv_cache_iswa ctors take mem_other + share
- per-layer share() maps drafter layers to sibling target KV layers
- all existing callers pass nullptr (additive, behavior unchanged)
- model factory dispatch + NEOX rope-type for gemma4-assistant
- create_memory builds share lambda + mem_other for gemma4-assistant

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
server-context sets cparams.ctx_other = ctx_tgt on both MTP draft-context
branches so the gemma4-assistant drafter can read the target's token embeddings
and share its KV cache. Harmless for other draft arches (only that arch reads it).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…edup

Three correctness fixes that take draft acceptance from ~2% to ~41%:
1. MTP runtime row width = n_embd_out (backbone 1536), not the drafter's
   internal n_embd (256); assert it matches the target hidden width.
2. pre-norm extraction (context.cpp) + get_embeddings_pre_norm_ith +
   embd_pre_norm buffer now sized/strided by n_embd_out, so the n_embd_out-wide
   nextn hidden survives the round trip.
3. target gemma4 graph now exposes its POST-final-norm hidden via t_h_pre_norm
   (the LM-head input feature the drafter consumes), matching upstream's
   t_h_nextn; gemma4-assistant routes its nextn hidden through the same seam.

Result on M4 Max Metal: 133.8 t/s with draft-mtp vs 109.5 t/s baseline (1.22x).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
embd_pre_norm.size now derives from n_embd_out; the n_embd local is dead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 29d0f32b-0165-46b4-b90a-af6b84c17424

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gemma4-assistant-arch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

lalalune added a commit to elizaOS/eliza that referenced this pull request Jun 24, 2026
…support (#9268)

Points the fork submodule at the gemma4-assistant port (validated: loads Google's
Gemma-4 MTP drafter + ~1.1x decode speedup via --spec-type draft-mtp on M4 Max
Metal). bcae29e65 (prior gitlink) is a clean ancestor, so this is a fast-forward
that adds the metal-tbq attn-score fix + the gemma4-assistant arch. Tracks fork
PR elizaOS/llama.cpp#32; re-point to the merged commit once that lands.

This is the runtime half of the Gemma-4 MTP drafter work — with this, the fused
engine built from the fork can load mtp/drafter-<tier>.gguf (the amaranus/Google
gemma-4-E2B-it-assistant head, wired in #9256) and run separate-drafter MTP.

Co-authored-by: Shaw <shawgotbags@gmail.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ant)

The fused engine builds its MTP draft context (eliza-inference-ffi.cpp) without
ctx_other, so a separate-drafter MTP arch that reaches into the target context
(gemma4-assistant: target tok_embd + hidden) hard-failed in the llama-context
ctor ("Gemma4Assistant requires ctx_other to be set"). server-context.cpp set it;
the fused FFI path did not. Set cp.ctx_other = e->ctx_tgt before init (inert for
same-file MTP archs that don't consult ctx_other).

Validated via the fused FFI on M4 Max Metal: loads the amaranus gemma4-assistant
drafter + runs MTP — drafted 28 / accepted 12 (43%), coherent output, 112 tok/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
lalalune added a commit to elizaOS/eliza that referenced this pull request Jun 24, 2026
… fused engine (#9273)

Follow-up to the gemma4-assistant arch bump: the fused engine (libelizainference,
the actual eliza runtime) built its MTP draft context without ctx_other, so a
gemma4-assistant drafter hard-failed there ("requires ctx_other"). Fixed in the
fork (elizaOS/llama.cpp#32) — set cp.ctx_other = ctx_tgt before init.

Validated end-to-end via the fused FFI on M4 Max Metal: the fused engine loads
the amaranus gemma4-assistant drafter + runs separate-drafter MTP (drafted 28 /
accepted 12 = 43%, coherent output, 112 tok/s). This is the runtime path the
plugin uses — so the Gemma-4 MTP drafter now works in eliza, not just llama-cli.

Co-authored-by: Shaw <shawgotbags@gmail.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ch IFGO

The fused pyannote-segmentation-3.0 diarizer read the BiLSTM gates as PyTorch
I,F,G,O while the GGUF actually packs them in ONNX I,O,F,C order. The scrambled
forget/output/cell gates made the diarizer over-detect overlap and hallucinate
speakers on inputs near the decision boundary (single-speaker golden-stt → 3
speakers, DER 0.79); it slipped past the small parity-fixture suite because
those inputs argmax to the right label anyway.

Verified against the reference ONNX (onnx-community/pyannote-segmentation-3.0):
with IOFC the C LSTM-0 output now matches the reference to float precision, and
golden-stt diarizes to 1 speaker / 0 overlap (DER ~0), agreeing with the
reference across single-speaker, silence, and mixed windows.

Refs elizaOS/eliza#9460.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant