mindspore-lab · yisunlp · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/2025-Ascend-Innovation-Contest/S1/MoE/efficient/patches.zip b/2025-Ascend-Innovation-Contest/S1/MoE/efficient/patches.zip
diff --git a/2025-Ascend-Innovation-Contest/S1/MoE/efficient/readme.md b/2025-Ascend-Innovation-Contest/S1/MoE/efficient/readme.md
@@ -0,0 +1,95 @@
+# 🏆 MoE Inference Optimization (1st Place Solution)
+
+<div align="center">
+
+![MindSpore](https://img.shields.io/badge/Framework-MindSpore-blue)
+![Hardware](https://img.shields.io/badge/Hardware-Ascend-red)
+![Status](https://img.shields.io/badge/Status-Champion-gold)
+
+**昇腾AI创新大赛 MoE - 冠军方案**
+
+</div>
+
+## 📖 项目简介 (Introduction)
+
+本项目是针对 **Qwen2MoE / Deepseek-MoE** 系列大模型在 **MindSpore** 框架与 **Huawei Ascend (昇腾)** 硬件上的推理性能优化方案。
+
+在保证模型精度无损（Logits 差异忽略不计）的前提下，通过重构 MoE 计算逻辑和引入 Flash Attention，显著降低了 Prefill/Decode 时延并优化了显存占用。本方案在比赛中获得了 **第一名** 的成绩。
+
+## 🚀 核心优化点 (Key Optimizations)
+
+### 1. MoE Block：权重堆叠与全量并行计算 (Stacked Weights & Parallel Computation)
+
+这是性能提升最关键的部分。原始的 MoE 实现通常采用 `Loop` + `Index Select` 的方式，这在 NPU 上会导致严重的性能瓶颈：
+* **Token Sorting/Indexing 开销**：`ops.nonzero` 和 `index_add` 会导致动态 Shape 问题，阻碍静态图编译优化。
+* **Kernel Launch 开销**：循环遍历 Expert 会产生数十个微小的矩阵乘法任务，无法填满 Ascend NPU 的 Cube 算力单元。
+
+**本方案优化策略：**
+* **权重堆叠 (Stacking)**：在初始化阶段，将所有 Expert 的 `Gate_proj`, `Up_proj`, `Down_proj` 权重分别堆叠为 `(Num_Experts, Hidden, Inter)` 的 3D Tensor。
+* **并行计算 (Broadcasting MatMul)**：移除 Python `for` 循环，利用 MindSpore 的 `MatMul` 广播机制或 `BatchMatMul`，一次性完成所有 Expert 对输入 Token 的计算。
+* **加权求和**：配合 Router Score 进行加权规约，替代复杂的索引拷贝操作。
+
+
+
+> **效果**：将原本串行的数十次小计算合并为 3 次大矩阵运算，极大提升了 **MFU (Model FLOPs Utilization)**。
+
+#### Note (important)：这种操作会计算所有expert，大大增加计算量，但在输入shape比较小的情况下，并行度带来的正收益远大于计算量带来的负收益
+
+### 2. Attention：集成 Flash Attention
+
+* **原版**：传统的 Attention 计算公式 $Softmax(Q \cdot K^T) \cdot V$ 会产生巨大的中间显存矩阵 $(Batch, Seq, Seq)$，且由于频繁的 HBM 读写导致 IO 瓶颈。
+* **优化**：集成了 `mindspore.ops.flash_attention_score` 算子。
+* **效果**：
+    * **显存优化**：中间激活显存占用从 $O(N^2)$ 降低到 $O(N)$，从而支持更长的 Context Window。
+    * **速度提升**：显著减少了 HBM 访问次数，加速 Prefill 和 Decode 阶段。
+
+### 3. LayerNorm：原生算子融合
+
+* 在 `Qwen2MoeRMSNorm` 中引入 `mindnlp.core.nn.functional.rms_norm`，利用底层优化好的 Kernel 替代原本的手写计算逻辑，进一步压榨性能。
+
+---
+
+## 📊 代码对比 (Code Comparison)
+
+### MoE Forward 逻辑
+
+**Before (Original): 串行循环**
+```python
+# 伪代码
+final_hidden_states = zeros(...)
+for expert_idx in range(num_experts):
+    # 1. 找出分配给当前 expert 的 token 索引
+    idx, top_x = ops.nonzero(expert_mask[expert_idx])
+    # 2. 索引切片
+    current_state = hidden_states[top_x]
+    # 3. 计算
+    current_out = experts[expert_idx](current_state)
+    # 4. 索引加回
+    final_hidden_states.index_add(0, top_x, current_out)
+```
+**After (Optimized): 向量化并行**
+```python
+# 伪代码
+# 1. 预处理：堆叠权重 (在 Init 阶段完成)
+# self.w1 shape: [Num_Experts, Dim_In, Dim_Inter]
+
+# 2. 全量计算 (无循环，无动态索引)
+# 利用广播机制计算所有 Expert
+hidden_w1 = ops.matmul(hidden_states, self.w1) 
+hidden_w3 = ops.matmul(hidden_states, self.w3)
+expert_out = self.act(hidden_w1) * hidden_w3
+expert_out = ops.bmm(expert_out, self.w2)
+
+# 3. 加权融合
+# router_scores 包含路由权重
+hidden_states = (expert_out * router_scores).sum(dim=0)
+```
+
+
+## 🏆 比赛成绩 (Competition Results)
+
+最终排名基于同一大类下多个模型的 Prefill 时延、Decode 时延、峰值显存占用三方面的优化率进行加权排序。
+
+| 排名 (Rank) | 团队名 (Team) | 单位 (Affiliation) | 峰值显存得分 | Prefill时延得分 | Decode时延得分 | 总分 (Total Score) |
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| 🥇 **1** | **efficient** | **苏州大学** | **79.0107** | **3982.5968** | **767.0094** | **1609.539** |