Skip to content

Gemma 4 forward pass reference points from a working CUDA implementation #96

@Mutdogus

Description

@Mutdogus

Disclosure: This issue was drafted and submitted by an AI assistant (Claude) on behalf of the repository owner. The technical content is based on analysis of public MIT-licensed llama.cpp source code and the author's working Gemma 4 + TurboQuant CUDA implementation.

Hey — I've been following your Gemma 4 work over the last few days and noticed you're hitting some of the same walls I went through. I have Gemma 4 running with TurboQuant KV cache compression on an NVIDIA RTX 4090 via a llama.cpp-based fork, so the architectural issues are fresh.

I'd submit patches directly, but your contribution guidelines exclude AI-generated code, and my implementation work was done with AI assistance, so I'll stick to pointing you at the right reference material instead. Everything below references the public MIT-licensed llama.cpp source.

1. layer_output_scale — the #1 divergence source

I see you've iterated on this 4+ times in the last 48 hours. The correct behavior from src/models/gemma4-iswa.cpp (ggml-org/llama.cpp):

// Applied AFTER all residual connections AND per-layer embedding (PLE)
// It's a simple elementwise multiply on the FULL accumulated tensor
cur = ggml_mul(ctx0, cur, model.layers[il].out_scale);

Key: it's applied to the entire accumulated hidden state (residual included), not to the layer's delta contribution. The "residual-separation" formula (x_old + scale * (x_current - x_old)) that I see in your history is incorrect — the model was trained with scaling applied to the full tensor. The values are small (e.g., 0.0178 for layer 0) but that's by design; each subsequent layer compensates.

Order of operations (this matters):

  1. Attention + attn_post_norm + residual add
  2. FFN (MoE or dense) + ffn_post_norm + residual add
  3. Per-layer embedding (PLE) + residual add
  4. Then layer_output_scale on the result
  5. That becomes inpL for the next layer

2. V-norm — RMSNorm on V before KV cache

Gemma 4 applies RMSNorm to the V projection output before storing to KV cache:

Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);

This is unusual — most architectures only norm K (for Q-K dot product stability). Gemma 4 norms both K and V. If you're norming V after cache retrieval or not at all, that's a source of divergence.

3. MoE router — non-standard logit calculation

The Gemma 4 expert router doesn't just project the hidden state. It:

  1. Takes attn_out (the post-attention residual, NOT the post-FFN-norm tensor)
  2. Applies RMSNorm
  3. Scales by 1.0 / sqrt(n_embd)
  4. Multiplies by a learned ffn_gate_inp_s scale tensor
  5. Then projects through ffn_gate_inp to get expert logits

If your router operates on the FFN-normed tensor instead of attn_out, or misses the 1/sqrt(n_embd) scaling, expert routing will be wrong and outputs will diverge even if individual expert FFNs are correct.

4. Proportional RoPE + dual head dimensions

Gemma 4 has different head_dim for full-attention vs sliding-window layers (e.g., 256 vs 128, or 512 vs 256). The RoPE dimensions and frequency base must switch per layer based on the layer type. Full-attention layers use learned rope_freqs (proportional RoPE) while sliding layers use computed frequencies from rope.freq_base_swa.

From the GGUF metadata: check that you're reading rope.freq_base_swa (or rope.local.freq_base as fallback) for the sliding-window layers, and using the per-layer rope_freqs tensor for full-attention layers.

5. Attention softcap

Gemma 2/3 use attn_logit_softcapping = 50.0. Gemma 4 does NOT. I see you have the config flag right (is_gemma4 && attn_logit_softcap == 0.0f) but worth double-checking it's not being applied somewhere in the attention computation path.

6. KV sharing

attention.shared_kv_layers — the last N layers reuse K/V projections from earlier same-type layers (same sliding/full classification). The reference layer lookup walks backward to find the most recent layer of the same attention type.


The file to study is src/models/gemma4-iswa.cpp in the ggml-org/llama.cpp repo — it's ~310 lines and covers the complete forward pass. For the model loading side (weight names, GGUF keys), see src/llama-model.cpp searching for LLM_ARCH_GEMMA4.

Happy to answer questions. Great project — looking forward to seeing Gemma 4 run on your CPU/WASM path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions