Disclosure: This issue was drafted and submitted by an AI assistant (Claude) on behalf of the repository owner. The technical content is based on analysis of public MIT-licensed llama.cpp source code and the author's working Gemma 4 + TurboQuant CUDA implementation.
Hey — I've been following your Gemma 4 work over the last few days and noticed you're hitting some of the same walls I went through. I have Gemma 4 running with TurboQuant KV cache compression on an NVIDIA RTX 4090 via a llama.cpp-based fork, so the architectural issues are fresh.
I'd submit patches directly, but your contribution guidelines exclude AI-generated code, and my implementation work was done with AI assistance, so I'll stick to pointing you at the right reference material instead. Everything below references the public MIT-licensed llama.cpp source.
1. layer_output_scale — the #1 divergence source
I see you've iterated on this 4+ times in the last 48 hours. The correct behavior from src/models/gemma4-iswa.cpp (ggml-org/llama.cpp):
// Applied AFTER all residual connections AND per-layer embedding (PLE)
// It's a simple elementwise multiply on the FULL accumulated tensor
cur = ggml_mul(ctx0, cur, model.layers[il].out_scale);
Key: it's applied to the entire accumulated hidden state (residual included), not to the layer's delta contribution. The "residual-separation" formula (x_old + scale * (x_current - x_old)) that I see in your history is incorrect — the model was trained with scaling applied to the full tensor. The values are small (e.g., 0.0178 for layer 0) but that's by design; each subsequent layer compensates.
Order of operations (this matters):
- Attention + attn_post_norm + residual add
- FFN (MoE or dense) + ffn_post_norm + residual add
- Per-layer embedding (PLE) + residual add
- Then
layer_output_scale on the result
- That becomes
inpL for the next layer
2. V-norm — RMSNorm on V before KV cache
Gemma 4 applies RMSNorm to the V projection output before storing to KV cache:
Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);
This is unusual — most architectures only norm K (for Q-K dot product stability). Gemma 4 norms both K and V. If you're norming V after cache retrieval or not at all, that's a source of divergence.
3. MoE router — non-standard logit calculation
The Gemma 4 expert router doesn't just project the hidden state. It:
- Takes
attn_out (the post-attention residual, NOT the post-FFN-norm tensor)
- Applies RMSNorm
- Scales by
1.0 / sqrt(n_embd)
- Multiplies by a learned
ffn_gate_inp_s scale tensor
- Then projects through
ffn_gate_inp to get expert logits
If your router operates on the FFN-normed tensor instead of attn_out, or misses the 1/sqrt(n_embd) scaling, expert routing will be wrong and outputs will diverge even if individual expert FFNs are correct.
4. Proportional RoPE + dual head dimensions
Gemma 4 has different head_dim for full-attention vs sliding-window layers (e.g., 256 vs 128, or 512 vs 256). The RoPE dimensions and frequency base must switch per layer based on the layer type. Full-attention layers use learned rope_freqs (proportional RoPE) while sliding layers use computed frequencies from rope.freq_base_swa.
From the GGUF metadata: check that you're reading rope.freq_base_swa (or rope.local.freq_base as fallback) for the sliding-window layers, and using the per-layer rope_freqs tensor for full-attention layers.
5. Attention softcap
Gemma 2/3 use attn_logit_softcapping = 50.0. Gemma 4 does NOT. I see you have the config flag right (is_gemma4 && attn_logit_softcap == 0.0f) but worth double-checking it's not being applied somewhere in the attention computation path.
6. KV sharing
attention.shared_kv_layers — the last N layers reuse K/V projections from earlier same-type layers (same sliding/full classification). The reference layer lookup walks backward to find the most recent layer of the same attention type.
The file to study is src/models/gemma4-iswa.cpp in the ggml-org/llama.cpp repo — it's ~310 lines and covers the complete forward pass. For the model loading side (weight names, GGUF keys), see src/llama-model.cpp searching for LLM_ARCH_GEMMA4.
Happy to answer questions. Great project — looking forward to seeing Gemma 4 run on your CPU/WASM path.
Hey — I've been following your Gemma 4 work over the last few days and noticed you're hitting some of the same walls I went through. I have Gemma 4 running with TurboQuant KV cache compression on an NVIDIA RTX 4090 via a llama.cpp-based fork, so the architectural issues are fresh.
I'd submit patches directly, but your contribution guidelines exclude AI-generated code, and my implementation work was done with AI assistance, so I'll stick to pointing you at the right reference material instead. Everything below references the public MIT-licensed llama.cpp source.
1.
layer_output_scale— the #1 divergence sourceI see you've iterated on this 4+ times in the last 48 hours. The correct behavior from
src/models/gemma4-iswa.cpp(ggml-org/llama.cpp):Key: it's applied to the entire accumulated hidden state (residual included), not to the layer's delta contribution. The "residual-separation" formula (
x_old + scale * (x_current - x_old)) that I see in your history is incorrect — the model was trained with scaling applied to the full tensor. The values are small (e.g., 0.0178 for layer 0) but that's by design; each subsequent layer compensates.Order of operations (this matters):
layer_output_scaleon the resultinpLfor the next layer2. V-norm — RMSNorm on V before KV cache
Gemma 4 applies RMSNorm to the V projection output before storing to KV cache:
This is unusual — most architectures only norm K (for Q-K dot product stability). Gemma 4 norms both K and V. If you're norming V after cache retrieval or not at all, that's a source of divergence.
3. MoE router — non-standard logit calculation
The Gemma 4 expert router doesn't just project the hidden state. It:
attn_out(the post-attention residual, NOT the post-FFN-norm tensor)1.0 / sqrt(n_embd)ffn_gate_inp_sscale tensorffn_gate_inpto get expert logitsIf your router operates on the FFN-normed tensor instead of
attn_out, or misses the1/sqrt(n_embd)scaling, expert routing will be wrong and outputs will diverge even if individual expert FFNs are correct.4. Proportional RoPE + dual head dimensions
Gemma 4 has different
head_dimfor full-attention vs sliding-window layers (e.g., 256 vs 128, or 512 vs 256). The RoPE dimensions and frequency base must switch per layer based on the layer type. Full-attention layers use learnedrope_freqs(proportional RoPE) while sliding layers use computed frequencies fromrope.freq_base_swa.From the GGUF metadata: check that you're reading
rope.freq_base_swa(orrope.local.freq_baseas fallback) for the sliding-window layers, and using the per-layerrope_freqstensor for full-attention layers.5. Attention softcap
Gemma 2/3 use
attn_logit_softcapping = 50.0. Gemma 4 does NOT. I see you have the config flag right (is_gemma4 && attn_logit_softcap == 0.0f) but worth double-checking it's not being applied somewhere in the attention computation path.6. KV sharing
attention.shared_kv_layers— the last N layers reuse K/V projections from earlier same-type layers (same sliding/full classification). The reference layer lookup walks backward to find the most recent layer of the same attention type.The file to study is
src/models/gemma4-iswa.cppin the ggml-org/llama.cpp repo — it's ~310 lines and covers the complete forward pass. For the model loading side (weight names, GGUF keys), seesrc/llama-model.cppsearching forLLM_ARCH_GEMMA4.Happy to answer questions. Great project — looking forward to seeing Gemma 4 run on your CPU/WASM path.