Gemma 4 forward pass reference points from a working CUDA implementation

> **Disclosure:** This issue was drafted and submitted by an AI assistant (Claude) on behalf of the repository owner. The technical content is based on analysis of public MIT-licensed llama.cpp source code and the author's working Gemma 4 + TurboQuant CUDA implementation.

Hey — I've been following your Gemma 4 work over the last few days and noticed you're hitting some of the same walls I went through. I have Gemma 4 running with TurboQuant KV cache compression on an NVIDIA RTX 4090 via a llama.cpp-based fork, so the architectural issues are fresh.

I'd submit patches directly, but your contribution guidelines exclude AI-generated code, and my implementation work was done with AI assistance, so I'll stick to pointing you at the right reference material instead. Everything below references the public MIT-licensed llama.cpp source.

## 1. `layer_output_scale` — the #1 divergence source

I see you've iterated on this 4+ times in the last 48 hours. The correct behavior from `src/models/gemma4-iswa.cpp` (ggml-org/llama.cpp):

```c
// Applied AFTER all residual connections AND per-layer embedding (PLE)
// It's a simple elementwise multiply on the FULL accumulated tensor
cur = ggml_mul(ctx0, cur, model.layers[il].out_scale);
```

**Key:** it's applied to the entire accumulated hidden state (residual included), not to the layer's delta contribution. The "residual-separation" formula (`x_old + scale * (x_current - x_old)`) that I see in your history is incorrect — the model was trained with scaling applied to the full tensor. The values are small (e.g., 0.0178 for layer 0) but that's by design; each subsequent layer compensates.

**Order of operations (this matters):**

1. Attention + attn_post_norm + residual add
2. FFN (MoE or dense) + ffn_post_norm + residual add
3. Per-layer embedding (PLE) + residual add
4. **Then** `layer_output_scale` on the result
5. That becomes `inpL` for the next layer

## 2. V-norm — RMSNorm on V *before* KV cache

Gemma 4 applies RMSNorm to the V projection output before storing to KV cache:

```c
Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);
```

This is unusual — most architectures only norm K (for Q-K dot product stability). Gemma 4 norms both K and V. If you're norming V after cache retrieval or not at all, that's a source of divergence.

## 3. MoE router — non-standard logit calculation

The Gemma 4 expert router doesn't just project the hidden state. It:

1. Takes `attn_out` (the post-attention residual, NOT the post-FFN-norm tensor)
2. Applies RMSNorm
3. Scales by `1.0 / sqrt(n_embd)`
4. Multiplies by a learned `ffn_gate_inp_s` scale tensor
5. Then projects through `ffn_gate_inp` to get expert logits

If your router operates on the FFN-normed tensor instead of `attn_out`, or misses the `1/sqrt(n_embd)` scaling, expert routing will be wrong and outputs will diverge even if individual expert FFNs are correct.

## 4. Proportional RoPE + dual head dimensions

Gemma 4 has different `head_dim` for full-attention vs sliding-window layers (e.g., 256 vs 128, or 512 vs 256). The RoPE dimensions and frequency base must switch per layer based on the layer type. Full-attention layers use learned `rope_freqs` (proportional RoPE) while sliding layers use computed frequencies from `rope.freq_base_swa`.

From the GGUF metadata: check that you're reading `rope.freq_base_swa` (or `rope.local.freq_base` as fallback) for the sliding-window layers, and using the per-layer `rope_freqs` tensor for full-attention layers.

## 5. Attention softcap

Gemma 2/3 use `attn_logit_softcapping = 50.0`. **Gemma 4 does NOT.** I see you have the config flag right (`is_gemma4 && attn_logit_softcap == 0.0f`) but worth double-checking it's not being applied somewhere in the attention computation path.

## 6. KV sharing

`attention.shared_kv_layers` — the last N layers reuse K/V projections from earlier same-type layers (same sliding/full classification). The reference layer lookup walks backward to find the most recent layer of the same attention type.

---

The file to study is `src/models/gemma4-iswa.cpp` in the ggml-org/llama.cpp repo — it's ~310 lines and covers the complete forward pass. For the model loading side (weight names, GGUF keys), see `src/llama-model.cpp` searching for `LLM_ARCH_GEMMA4`.

Happy to answer questions. Great project — looking forward to seeing Gemma 4 run on your CPU/WASM path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 forward pass reference points from a working CUDA implementation #96

1. `layer_output_scale` — the #1 divergence source

2. V-norm — RMSNorm on V before KV cache

3. MoE router — non-standard logit calculation

4. Proportional RoPE + dual head dimensions

5. Attention softcap

6. KV sharing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gemma 4 forward pass reference points from a working CUDA implementation #96

Description

1. layer_output_scale — the #1 divergence source

2. V-norm — RMSNorm on V before KV cache

3. MoE router — non-standard logit calculation

4. Proportional RoPE + dual head dimensions

5. Attention softcap

6. KV sharing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `layer_output_scale` — the #1 divergence source