123 lines (100 loc) · 7.53 KB

模型架构详细对比

本文档详细列出 Qwen3.5 MoE、Llama4、GLM4、Gemma、GPT-OSS 五个模型除了 Linear 层之外的所有运算层，包括激活层、归一化层和位置编码。

1. Qwen3.5 MoE

文本部分（Text Model）

运算类型	具体实现	使用位置
归一化层	`Qwen3_5MoeRMSNorm` (1-centered, `1 + weight`)	• Decoder 层前归一化 (`input_layernorm`) • Attention 后归一化 (`post_attention_layernorm`) • QK 归一化 (`q_norm`, `k_norm`)
归一化层	`Qwen3_5MoeRMSNormGated` (带门控的 RMSNorm)	• Linear Attention 层的输出归一化 (与 SiLU gate 结合)
激活函数	`SiLU` (通过 `ACT2FN[config.hidden_act]`)	• MLP 的 gate 分支 • Shared Expert 的 gate 分支 • Linear Attention 的 Conv1d 输出
激活函数	`Sigmoid`	• Attention 输出的 gate (`attn_output * sigmoid(gate)`) • Shared Expert 的权重门控 • Linear Attention 的 beta 参数
激活函数	`Softmax`	• Attention 权重计算 • Router logits 归一化
位置编码	`Qwen3_5MoeTextRotaryEmbedding` (RoPE)	• Full Attention 层的 QK 位置编码
卷积层	`nn.Conv1d` (groups=conv_dim, kernel_size=4)	• Linear Attention (Gated Delta Net) 的时序卷积
路由层	`Qwen3_5MoeTopKRouter` (Linear + Softmax + TopK)	• MoE 专家选择
其他运算	L2 Norm (`l2norm`)	• Linear Attention 的 QK 归一化 (可选)
其他运算	Gated Delta Rule (chunk/recurrent)	• Linear Attention 的核心注意力机制
其他运算	Cumulative Sum (`cumsum`)	• Linear Attention 的衰减计算
其他运算	Exponential (`exp`)	• Linear Attention 的门控衰减

视觉部分（Vision Model）

运算类型	具体实现	使用位置
归一化层	`nn.LayerNorm` (eps=1e-6)	• Vision Block 的 Attention 前后 (`norm1`, `norm2`) • Patch Merger 的归一化
激活函数	`GELU (PyTorch tanh)`	• Vision MLP • Patch Merger
位置编码	`Qwen3_5MoeVisionRotaryEmbedding` (2D RoPE)	• Vision Attention 的 QK 位置编码
卷积层	`nn.Conv3d` (3D 卷积)	• Patch Embedding (时空分块)

2. Llama4

文本部分（Text Model）

运算类型	具体实现	使用位置
归一化层	`Llama4TextRMSNorm`	• Decoder 层前归一化 (`input_layernorm`) • Attention 后归一化 (`post_attention_layernorm`)
归一化层	`Llama4TextL2Norm`	• QK 归一化 (`qk_norm`，仅在 `use_qk_norm=True` 且使用 RoPE 的层)
激活函数	`SiLU` (通过 `ACT2FN[config.hidden_act]`)	• MLP 的 gate 分支 • Shared Expert 的 gate 分支 • MoE Experts 的 gate 分支
激活函数	`Sigmoid`	• Router 的 sigmoid 激活 (用于专家权重)
激活函数	`Softmax`	• Attention 权重计算
位置编码	`Llama4TextRotaryEmbedding` (RoPE, 复数表示 `freqs_cis`)	• 部分层使用 RoPE (`no_rope_layers` 控制)
位置编码	Temperature Tuning (对数缩放)	• NoROPE 层的 query 缩放 (`attn_temperature_tuning`)
路由层	`Llama4Router` (Linear + TopK + Sigmoid)	• MoE 专家选择
其他运算	Scatter (`scatter_`)	• Router 的 logits 处理
其他运算	Repeat + Sum	• MoE 的专家输出聚合

视觉部分（Vision Model）

运算类型	具体实现	使用位置
归一化层	`nn.LayerNorm`	• Vision Encoder Layer 的 Attention 前后
激活函数	`nn.GELU()`	• Vision MLP • Vision MLP2 (Projector) • Patch Merger
激活函数	`Dropout`	• Vision MLP2 (Projector)
位置编码	`Llama4VisionRotaryEmbedding` (2D RoPE, 复数表示)	• Vision Attention 的 QK 位置编码
其他运算	Pixel Shuffle	• Vision Projector 的空间重排

3. GLM4

运算类型	具体实现	使用位置
归一化层	`Glm4RMSNorm`	• 4 层归一化： 1. `input_layernorm` (Attention 前) 2. `post_self_attn_layernorm` (Attention 后，残差前) 3. `post_attention_layernorm` (第二次残差后，MLP 前) 4. `post_mlp_layernorm` (MLP 后，残差前)
激活函数	`SiLU` (通过 `ACT2FN[config.hidden_act]`)	• MLP 的 gate 分支
激活函数	`Softmax`	• Attention 权重计算
位置编码	`Glm4RotaryEmbedding` (RoPE)	• Attention 的 QK 位置编码
其他运算	`repeat_kv` (GQA 的 KV 重复)	• Attention 中扩展 KV heads
其他运算	`rotate_half`	• RoPE 的旋转操作

4. Gemma

运算类型	具体实现	使用位置
归一化层	`GemmaRMSNorm` (1-centered, `1 + weight`, weight 初始化为 0)	• Decoder 层前归一化 (`input_layernorm`) • Attention 后归一化 (`post_attention_layernorm`)
激活函数	`GELU (PyTorch tanh)` (通过 `ACT2FN[config.hidden_act]`)	• MLP 的 gate 分支
激活函数	`Softmax`	• Attention 权重计算
位置编码	`GemmaRotaryEmbedding` (RoPE)	• Attention 的 QK 位置编码
其他运算	`repeat_kv` (GQA 的 KV 重复)	• Attention 中扩展 KV heads
其他运算	`rotate_half`	• RoPE 的旋转操作
其他运算	Interleave (`repeat_interleave`)	• RoPE 的 cos/sin 插值

5. GPT-OSS

运算类型	具体实现	使用位置
归一化层	`GptOssRMSNorm` (继承自 `LlamaRMSNorm`)	• Decoder 层前归一化 (`input_layernorm`) • Attention 后归一化 (`post_attention_layernorm`)
激活函数	自定义 Gated SiLU (`gate * sigmoid(gate * 1.702)`)	• MoE Experts 的门控激活
激活函数	`Clamp` (min/max=±7.0)	• MoE Experts 的 gate 和 up 分支限幅
激活函数	`Sigmoid`	• 自定义门控激活的一部分
激活函数	`Softmax`	• Attention 权重计算 • Router logits 归一化
位置编码	`GptOssRotaryEmbedding` (RoPE)	• Attention 的 QK 位置编码
路由层	`GptOssTopKRouter` (Linear + TopK + Softmax)	• MoE 专家选择
其他运算	Sinks (可学习参数 `nn.Parameter`)	• Attention 的辅助参数 (`s_aux`)
其他运算	Sliding Window Attention	• 部分层使用滑动窗口注意力 (`layer_type="sliding_attention"`)

关键差异总结

归一化层数量

GLM4 最多（4 层 RMSNorm）
Qwen3.5 有特殊的 RMSNormGated（与 SiLU gate 结合）
Gemma 的 RMSNorm 权重初始化为 0（1-centered）

激活函数

GPT-OSS 使用自定义的 Gated SiLU（带 clamp 和特殊系数 1.702）
Qwen3.5/GLM4/Llama4 文本部分用 SiLU
Gemma/Llama4/Qwen3.5 视觉部分用 GELU

位置编码

所有模型都用 RoPE
Llama4 部分层不用 RoPE，改用 Temperature Tuning
Llama4 使用复数表示 (freqs_cis)，其他用 cos/sin 分离

特殊运算

Qwen3.5：Linear Attention (Gated Delta Net + Conv1d)
Llama4：Pixel Shuffle、Temperature Tuning、L2 Norm (QK)
GPT-OSS：Sinks 参数、Sliding Window Attention
GLM4：4 层归一化的独特架构

MoE 路由

Qwen3.5：Softmax + TopK + Shared Expert (Sigmoid gate)
Llama4：TopK + Sigmoid + Shared Expert
GPT-OSS：TopK + Softmax